一个数据科学框架:可以达到99%的准确度

A Data Science Framework: To Achieve 99% Accuracy

很不错的一篇文章,但是英文苦手,所以就借助谷歌和自己的渣英文简单翻译了,反正自己能看懂,哈哈。

Hello and Welcome to Kaggle, the online Data Science Community to learn, share, and compete. Most beginners get lost in the field, because they fall into the black box approach, using libraries and algorithms they don’t understand. This tutorial will give you a 1-2-year head start over your peers, by providing a framework that teaches you how-to think like a data scientist vs what to think/code. Not only will you be able to submit your first competition, but you’ll be able to solve any problem thrown your way. I provide clear explanations, clean code, and plenty of links to resources. Please Note: This Kernel is still being improved. So check the Change Logs below for updates. Also, please be sure to upvote, fork, and comment and I’ll continue to develop. Thanks, and may you have “statistically significant” luck!

欢迎来到Kaggle,在一个在线数据科学社区学习,分享和参与。大多数初学者在这里迷路了,因为他们陷入了黑盒子的方式,使用他们不明白的库和算法。本教程通过提供一个框架,教会您如何像数据科学家一样思考和思考/编程,从而使您在同行中获得1-2年的领先优势。你不仅可以提交你的第一场比赛,但你可以解决任何问题。我提供了清晰的解释,干净的代码和大量的资源链接。请注意:这个内核还在改进之中。因此请查看下面的更改日志以获取更新。另外,请务必注意,分叉和评论,我会继续发展。谢谢,可能你有“统计意义上的”运气!

Table of Contents

Chapter 1 - How a Data Scientist Beat the Odds
Chapter 2 - A Data Science Framework
Chapter 3 - Step 1: Define the Problem and Step 2: Gather the Data
Chapter 4 - Step 3: Prepare Data for Consumption
Chapter 5 - The 4 C’s of Data Cleaning: Correcting, Completing, Creating, and Converting
Chapter 6 - Step 4: Perform Exploratory Analysis with Statistics
Chapter 7 - Step 5: Model Data
Chapter 8 - Evaluate Model Performance
Chapter 9 - Tune Model with Hyper-Parameters
Chapter 10 - Tune Model with Feature Selection
Chapter 11 - Step 6: Validate and Implement
Chapter 12 - Conclusion and Step 7: Optimize and Strategize
Change Log

目录

  1. 第一章 - 数据科学家如何击败概率
  2. 第二章 - 数据科学框架
  3. 第三章 - 第一步:定义问题和第二步:收集数据
  4. 第四章 - 第三步:为消费准备数据
  5. 第五章 - 数据清理:修正,完成,创建和转换
  6. 第6章 - 第4步:使用统计进行探索性分析
  7. 第7章 - 第5步:模型数据
  8. 第8章 - 评估模型性能
  9. 第9章 - 使用超参数调整模型
  10. 第10章 - 调整模型功能选择
  11. 第11章 - 第6步:验证和实现
  12. 第12章 - 总结

Credits
How-to Use this Tutorial: Read the explanations provided in this Kernel and the links to developer documentation. The goal is to not just learn the whats, but the whys. If you don’t understand something in the code the print() function is your best friend. In coding, it’s okay to try, fail, and try again. If you do run into problems, Google is your second best friend, because 99.99% of the time, someone else had the same question/problem and already asked the coding community. If you’ve exhausted all your resources, the Kaggle Community via forums and comments can help too.

关于:
如何使用本教程:阅读此内核中提供的解释和开发者文档的链接。目标是不只是学习什么,而是什么。如果你不不懂代码的东西,那么print()函数是你最好的朋友。在编码方面,可以尝试,失败,然后重试。如果你遇到问题,Google是你的第二好朋友,因为99.99%的时间,别人有同样的问题和问题,并已经问过编程社区。如果你已经耗尽了所有的资源,通过论坛和评论Kaggle社区也可以帮助。

How a Data Scientist Beat the Odds

It’s the classical problem, predict the outcome of a binary event. In laymen terms this means, it either occurred or did not occur. For example, you won or did not win, you passed the test or did not pass the test, you were accepted or not accepted, and you get the point. A common business application is churn or customer retention. Another popular use case is, healthcare’s mortality rate or survival analysis. Binary events create an interesting dynamic, because we know statistically, a random guess should achieve a 50% accuracy rate, without creating one single algorithm or writing one single line of code. However, just like autocorrect spellcheck technology, sometimes we humans can be too smart for our own good and actually underperform a coin flip. In this kernel, I use Kaggle’s Getting Started Competition, Titanic: Machine Learning from Disaster, to walk the reader through, how-to use the data science framework to beat the odds.

What happens when technology is too smart for its own good?

数据科学家如何击败概率

这是一个经典问题,预测二元事件的结果。从外行的角度来说,这意味着,它不是发生就是没有发生。例如,你赢了或者没有赢,你通过了考试或者没有通过考试,你被接受或者不被接受,你就明白了。常见的业务应用程序是流失或客户保留。另一个流行的用例是医疗保健的死亡率或生存分析。二元事件产生了一个有趣的动态,因为我们统计可以知道,随机猜测应该达到50%的准确率,而不需要创建一个单一的算法或编写一行代码。但是,就像自动纠正拼写检查技术一样,有时我们人类可能对我们自己而言太聪明,实际上表现不及投掷硬币。在这个内核中,我使用Kaggle的入门比赛,泰坦尼克号:
当技术比我们更聪明时,会发生什么?

Funny Autocorrect

A Data Science Framework

Define the Problem: If data science, big data, machine learning, predictive analytics, business intelligence, or any other buzzword is the solution, then what is the problem? As the saying goes, don’t put the cart before the horse. Problems before requirements, requirements before solutions, solutions before design, and design before technology. Too often we are quick to jump on the new shiny technology, tool, or algorithm before determining the actual problem we are trying to solve.

Gather the Data: John Naisbitt wrote in his 1984 (yes, 1984) book Megatrends, we are “drowning in data, yet staving for knowledge.” So, chances are, the dataset(s) already exist somewhere, in some format. It may be external or internal, structured or unstructured, static or streamed, objective or subjective, etc. As the saying goes, you don’t have to reinvent the wheel, you just have to know where to find it. In the next step, we will worry about transforming “dirty data” to “clean data.”

Prepare Data for Consumption: This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.

Perform Exploratory Analysis: Anybody who has ever worked with data knows, garbage-in, garbage-out (GIGO). Therefore, it is important to deploy descriptive and graphical statistics to look for potential problems, patterns, classifications, correlations and comparisons in the dataset. In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.

Model Data: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It’s important to remember, algorithms are tools and not magical wands or silver bullets. You must still be the master craft (wo)man that knows how-to select the right tool for the job. An analogy would be asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible. The same is true in data modelling. The wrong model can lead to poor performance at best and the wrong conclusion (that’s used as actionable intelligence) at worst.

Validate and Implement Data Model: After you’ve trained your model based on a subset of your data, it’s time to test your model. This helps ensure you haven’t overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.

Optimize and Strategize: This is the “bionic man” step, where you iterate back through the process to make it better…stronger…faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design. Once you’re able to package your ideas, this becomes your “currency exchange” rate.

数据科学框架

定义问题:如果数据科学,大数据,机器学习,预测分析,商业智能或任何其他流行语是解决方案,那么问题是什么?俗话说,不要把马车放在马前(译者注:不要本末倒置)。要求之前的是问题,解决方案之前的是要求,设计之前的是解决方案和技术之前的是设计。在确定我们正在试图解决的实际问题之前,我们经常会很快用上新的闪闪发光(译者注:shining,作者调皮)的技术,工具或算法。

收集数据:约翰·奈斯比特(John Naisbitt)在他的《1984》(是的,1984)一书中写道,大数据趋势是,我们正在“淹没在数据中,而不是为了知识。”所以,数据集可能已经存在某处。可能是外部的或内部的,结构化的或非结构化的,静态的或流式的,客观的或主观的等等。俗话说,你不必重新发明轮子,你只需要知道在哪里找到它。下一步,我们担心将“脏数据”转换为“干净的数据”。

为消费准备数据:这一步通常被称为数据整理,这是将“狂放的”数据转化为“可管理”数据所需的过程。数据争夺包括实施用于存储和处理的数据架构,开发质量和控制的数据管理标准,数据提取(即ETL和网页抓取)以及数据清理以识别异常,丢失或异常数据点。

进行探索性分析:任何曾经使用数据的人都知道,垃圾进入,垃​​圾出(GIGO)。因此,部署描述性和图形化统计数据来查找数据集中潜在的问题,模式,分类,相关性和比较非常重要。此外,数据分类(即定性与定量)对于理解和选择正确的假设检验或数据模型也很重要。

模型数据:与描述性和推理性统计数据类似,数据建模可以总结数据或预测未来结果。您的数据集和预期结果将决定可供使用的算法。重要的是要记住,算法是工具,而不是魔杖或银弹(译者注:在西方神话中银弹是克制吸血鬼的武器)。你仍然必须是知道如何为工作选择正确的工具的工匠。一个比喻就是要求有人给你一把菲利普螺丝刀,他们给你一把平头螺丝刀或者最差的一把锤子。充其量,它显示完全缺乏了解。最糟糕的是,这使得项目不可能完成。数据建模也是如此。错误的模型可能导致最差的表现,最坏的结论(作为可操作的情报)会导致错误的结论。

验证和实现数据模型:在基于数据的一个子集训练好模型之后,该测试模型了。这有助于确保您没有过度使用您的模型,或者使其对于所选子集具有如此特定的效果,以至于不能精确地匹配来自同一数据集的另一个子集。在这一步中,我们确定是否我们的模型过度拟合,推广或不适合我们的数据集。

优化和策略:这是“仿生人”的一步,你通过这个过程迭代,使其更好…更强大…比以前更快。作为一名数据科学家,您的策略应该是将开发人员操作和应用程序外包,这样您就有更多时间专注于推荐和设计。一旦你能够打包你的想法,这成为你的“货币兑换”率。

Step 1: Define the Problem

For this project, the problem statement is given to us on a golden plater, develop an algorithm to predict the survival outcome of passengers on the Titanic.

第一步:定义问题

对于这个项目,问题陈述是给予我们一个镀金匠,开发一个算法来预测泰坦尼克号乘客的生存结果。

关于 golden plater:直译应该是金色镀金匠。但是我不是很懂这个的意思,结合上下文应该是指问题描述是非常重要的,告诉我们这个要开发的算法是是拿来做什么的

……

Project Summary: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

项目简介:沉没的泰坦尼克号是历史上最臭名昭着的沉船之一。1912年4月15日,在首航期间,泰坦尼克号撞上一座冰山后沉没,2224名乘客和机组人员中有1502人遇难。这一耸人听闻的悲剧震撼了国际社会,并导致了船舶安全条例的更完善。

沉船导致生命损失的原因之一是乘客和船员没有足够的救生艇。虽然幸存下来的运气有一些因素,但一些人比其他人更有可能生存,比如:妇女,儿童和上层阶级。

在这个挑战中,我们要求你完成对什么样的人可能生存的分析。特别是,我们要求你运用机器学习的工具来预测哪些乘客幸存下来的悲剧。

Practice Skills

Binary classification
Python and R basics

实践技能

  • 二进制分类
  • Python和R基础知识

Step 2: Gather the Data

The dataset is also given to us on a golden plater with test and train data at Kaggle’s Titanic: Machine Learning from Disaster

第二步:收集数据

数据集也在Kaggle的泰坦尼克号:机器学习从灾难中的测试和训练数据上下载

Step 3: Prepare Data for Consumption

Since step 2 was provided to us on a golden plater, so is step 3. Therefore, normal processes in data wrangling, such as data architecture, governance, and extraction are out of scope. Thus, only data cleaning is in scope.

第三步:准备使用的数据

由于步骤2是在提供给我们的是一个镀金匠(译者注:应该是指这个数据kaggle已经給好了,无须操作),步骤3也是如此。因此,数据架构,管理和提取等数据争用过程中的正常流程不在范围之内。因此,只有数据清理是在范围之内的。

3.1 Import Libraries

The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks. The idea is why write ten lines of code, when you can write one line.

3.1导入库

下面的代码是用Python 3.x编写的。库提供预编写的功能来执行必要的任务。这个想法是为什么写十行代码的时候,你可以写一行代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))
import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))
import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))
import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))
import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__))
import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__))
import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))
#misc libraries
import random
import time
#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from subprocess import check_output
print(check_output(["powershell.exe"," ls", "../input"]).decode("utf8")) #OS: windows10 use this
# If your computer OS is linux use next one
#print(check_output([," ls", "../input"]).decode("utf8"))
# Any results you write to the current directory are saved as output.
Python version: 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]
pandas version: 0.20.3
matplotlib version: 2.0.2
NumPy version: 1.13.3
SciPy version: 1.0.0
IPython version: 6.1.0
scikit-learn version: 0.19.1
-------------------------


    Ŀ¼: E:\onedrive\python_codes\Kaggle\Titanic\input


Mode                LastWriteTime         Length Name                                                                  
----                -------------         ------ ----                                                                  
-a----       2017/12/20     16:42           3258 gender_submission.csv                                                 
-a----       2017/12/20     16:42          28629 test.csv                                                              
-a----       2017/12/20     16:42          61194 train.csv                                                             

3.11 Load Data Modelling Libraries¶

We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.

3.11加载数据建模库

我们将使用流行的scikit学习库来开发我们的机器学习算法。在sklearn中,算法被称为估计器,并在他们自己的类中实现。为了数据可视化,我们将使用matplotlib和seaborn库。以下是常见的类加载。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

3.2 Meet and Greet Data

This is the meet and greet step. Get to know your data by first name and learn a little bit about it. What does it look like (datatype and values), what makes it tick (independent/feature variables(s)), what’s its goals in life (dependent/target variable(s)). Think of it like a first date, before you jump in and start poking it in the bedroom.

To begin this step, we first import our data. Next we use the info() and sample() function, to get a quick and dirty overview of variable datatypes (i.e. qualitative vs quantitative). Click here for the Source Data Dictionary.

  1. The Survived variable is our outcome or dependent variable. It is a binary nominal datatype of 1 for survived and 0 for did not survive. All other variables are potential predictor or independent variables. It’s important to note, more predictor variables do not make a better model, but the right variables.

  2. The PassengerID and Ticket variables are assumed to be random unique identifiers, that have no impact on the outcome variable. Thus, they will be excluded from analysis.

  3. The Pclass variable is an ordinal datatype for the ticket class, a proxy for socio-economic status (SES), representing 1 = upper class, 2 = middle class, and 3 = lower class.

  4. The Name variable is a nominal datatype. It could be used in feature engineering to derive the gender from title, family size from surname, and SES from titles like doctor or master. Since these variables already exist, we’ll make use of it to see if title, like master, makes a difference.

  5. The Sex and Embarked variables are a nominal datatype. They will be converted to dummy variables for mathematical calculations.

  6. The Age and Fare variable are continuous quantitative datatypes.

  7. The SibSp represents number of related siblings/spouse aboard and Parch represents number of related parents/children aboard. Both are discrete quantitative datatypes. This can be used for feature engineering to create a family size and is alone variable.

  8. The Cabin variable is a nominal datatype that can be used in feature engineering for approximate position on ship when the incident occurred and SES from deck levels. However, since there are many null values, it does not add value and thus is excluded from analysis.

3.2 迎接和问候数据

这是见面和迎接的一步。通过名字了解你的数据,并学习一点点。它看起来是什么样子(数据类型和值),是什么让它打勾(独立/特征变量),它的目标是什么(依赖/目标变量)。把它想象成第一次约会,然后再跳进去,开始在卧室里捅它。
要开始这一步,我们首先导入我们的数据。接下来我们使用info()和sample()函数来获得变量数据类型(即定性和定量)的快速和肮脏的概述。点击这里查看源数据字典。

  1. Survived 变量是我们的结果或因变量。这是一个二进制名义数据类型为 1 生还和 0 没有生还。所有其他变量都是潜在的预测变量或自变量。重要的是要注意,更多的预测变量不是一个更好的模型,而是正确的变量。
  2. PassengerID 和 Ticket 变量被假定为随机的唯一标识符,对结果变量没有影响。因此,他们将被排除在分析之外。
  3. Pclass 变量是票据类的序数据类型,是社会经济地位(SES)的代理,代表1 =上层,2 =中层,3 =下层。
  4. Name 变量是一个名义数据类型。它可以用于特征工程,以从头衔,姓氏的家庭大小,从医生或主人这样的头衔获得SES。由于这些变量已经存在,我们将使用它来查看标题,如master,是否有所作为。
  5. Sex和 Embarked 变量是名词(译者注:字符串string)数据类型。他们将被转换为虚拟变量进行数学计算。
  6. Age 和 Fare 变量是连续的定量数据类型。
  7. SibSp 表示船上相关兄弟姐妹/配偶的人数,Parch 表示船上相关父母/子女的人数。两者都是离散的定量数据类型。这可以用于特征工程创建一个家庭大小,是单独变量。
  8. Cabin 变量是一种名义数据类型,可以在特征工程中用于船舶发生事故时的近似位置和经济水平的SES。但是,由于有许多空值,因此不会增加值,因此不能从分析中排除。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#import data from file: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data_raw = pd.read_csv('../input/train.csv')
#a dataset should be broken into 3 splits: train, test, and (final) validation
#the test file provided is the validation file for competition submission
#we will split the train set into train and test data in future sections
data_val = pd.read_csv('../input/test.csv')
#to play with our data we'll create a copy
#remember python assignment or equal passes by reference vs values, so we use the copy function: https://stackoverflow.com/questions/46327494/python-pandas-dataframe-copydeep-false-vs-copydeep-true-vs
data1 = data_raw.copy(deep = True)
#however passing by reference is convenient, because we can clean both datasets at once
data_cleaner = [data1, data_val]
#preview data
print (data_raw.info()) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
#data_raw.head() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
#data_raw.tail() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html
data_raw.sample(10) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None












































































































































































PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
544 545 0 1 Douglas, Mr. Walter Donald male 50.0 1 0 PC 17761 106.4250 C86 C
354 355 0 3 Yousif, Mr. Wazli male NaN 0 0 2647 7.2250 NaN C
331 332 0 1 Partner, Mr. Austen male 45.5 0 0 113043 28.5000 C124 S
463 464 0 2 Milling, Mr. Jacob Christian male 48.0 0 0 234360 13.0000 NaN S
842 843 1 1 Serepeca, Miss. Augusta female 30.0 0 0 113798 31.0000 NaN C
545 546 0 1 Nicholson, Mr. Arthur Ernest male 64.0 0 0 693 26.0000 NaN S
473 474 1 2 Jerwan, Mrs. Amin S (Marie Marthe Thuillard) female 23.0 0 0 SC/AH Basle 541 13.7917 D C
730 731 1 1 Allen, Miss. Elisabeth Walton female 29.0 0 0 24160 211.3375 B5 S
200 201 0 3 Vande Walle, Mr. Nestor Cyriel male 28.0 0 0 345770 9.5000 NaN S
676 677 0 3 Sawyer, Mr. Frederick Charles male 24.5 0 0 342826 8.0500 NaN S

3.21 The 4 C’s of Data Cleaning: Correcting, Completing, Creating, and Converting

In this stage, we will clean our data by 1) correcting aberrant values and outliers, 2) completing missing information, 3) creating new features for analysis, and 4) converting fields to the correct format for calculations and presentation.

  1. Correcting: Reviewing the data, there does not appear to be any aberrant or non-acceptable data inputs. In addition, we see we may have potential outliers in age and fare. However, since they are reasonable values, we will wait until after we complete our exploratory analysis to determine if we should include or exclude from the dataset. It should be noted, that if they were unreasonable values, for example age = 800 instead of 80, then it’s probably a safe decision to fix now. However, we want to use caution when we modify data from its original value, because it may be necessary to create an accurate model.

  2. Completing: There are null values or missing data in the age, cabin, and embarked field. Missing values can be bad, because some algorithms don’t know how-to handle null values and will fail. While others, like decision trees, can handle null values. Thus, it’s important to fix before we start modeling, because we will compare and contrast several models. There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it’s best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation. An intermediate methodology is to use the basic methodology based on specific criteria; like the average age by class or embark port by fare and SES. There are more complex methodologies, however before deploying, it should be compared to the base model to determine if complexity truly adds value. For this dataset, age will be imputed with the median, the cabin attribute will be dropped, and embark will be imputed with mode. Subsequent model iterations may modify this decision to determine if it improves the model’s accuracy.

  3. Creating: Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome. For this dataset, we will create a title feature to determine if it played a role in survival.

  4. Converting: Last, but certainly not least, we’ll deal with formatting. There are no date or currency formats, but datatype formats. Our categorical data imported as objects, which makes it difficult for mathematical calculations. For this dataset, we will convert object datatypes to categorical dummy variables.

3.21数据清理4 C:修正,完成,创建和转换

在这个阶段,我们将通过1)校正异常值和异常值,2)完成缺失信息,3)创建新的分析功能,4)将字段转换为正确的格式进行计算和表示。

  1. 纠正:审查数据,似乎没有任何异常或不可接受的数据输入。另外,我们看到我们可能在年龄和票价上有潜在的异常值。但是,由于它们是合理的值,我们将等到完成我们的探索性分析后确定是否应包含或排除数据集。应该指出的是,如果它们是不合理的价值,例如年龄= 800而不是80,那么现在就可以做出安全的决定。但是,当我们从原始值修改数据时,我们要谨慎,因为可能需要创建一个精确的模型。
  2. 完成:在年龄,机舱和装载领域有空值或缺失数据。缺少值可能是不好的,因为一些算法不知道如何处理空值,将失败。而其他人,如决策树,可以处理空值。因此,在我们开始建模之前修复是很重要的,因为我们将比较和对比几个模型。有两种常用方法,可以删除记录,也可以使用合理的输入填充缺失的值。不建议删除记录,特别是大部分记录,除非它确实代表不完整的记录。相反,最好是对缺失值进行补偿。定性数据的基本方法是使用模式。定量数据的基本方法是使用平均值,中位数或平均值+随机标准偏差进行估算。一种中间方法是使用基于特定标准的基本方法; 比如按班级平均年龄或按票价和SES(经济地位)登船。还有更复杂的方法,但是在部署之前,应该将其与基础模型进行比较,以确定复杂性是否真正增加了价值。对于这个数据集,年龄将被计算与中位数,客舱属性将被放弃,并进行模式估算。后续的模型迭代可以修改这个决定,以确定它是否提高了模型的准确性。年龄将与中位数进行估算,客舱属性将被丢弃,并将以模式进行出发。后续的模型迭代可以修改这个决定,以确定它是否提高了模型的准确性。年龄将与中位数进行估算,客舱属性将被丢弃,并将以模式进行出发。后续的模型迭代可以修改这个决定,以确定它是否提高了模型的准确性。
  3. 创建:特征工程就是当我们使用现有特征来创建新特征来确定它们是否提供了新的信号来预测我们的结果。对于这个数据集,我们将创建一个标题特征来确定它是否在生存中起作用。
  4. 转换:最后,但并非最不重要,我们将处理格式。没有日期或货币格式,但数据格式。我们将分类数据导入为对象,这使得数学计算变得困难。对于这个数据集,我们将把对象数据类型转换为分类虚拟变量。
1
2
3
4
5
6
7
print('Train columns with null values:\n', data1.isnull().sum())
print("-"*10)
print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*10)
data_raw.describe(include = 'all')
Train columns with null values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
----------
Test/Validation columns with null values:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
----------



























































































































































































PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Hoyt, Mr. William Fisher male NaN NaN NaN 347082 NaN G6 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN

3.22 Clean Data

Now that we know what to clean, let’s execute our code.

Developer Documentation:

3.22 数据清洗

现在我们将了解如何清洗,让我们执行我们的代码吧。

开发文档:

pandas.DataFrame

pandas.DataFrame.info

pandas.DataFrame.describe

Indexing and Selecting Data

pandas.isnull

pandas.DataFrame.sum

pandas.DataFrame.mode

pandas.DataFrame.copy

pandas.DataFrame.fillna

pandas.DataFrame.drop

pandas.Series.value_counts

pandas.DataFrame.loc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
##COMPLETING: complete or delete missing values in train and test/validation dataset
for dataset in data_cleaner:
#complete missing age with median
dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
#complete embarked with mode
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
#complete missing fare with median
dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
#delete the cabin feature/column and others previously stated to exclude in train dataset
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)
print(data1.isnull().sum())
print("-"*10)
print(data_val.isnull().sum())
Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
----------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
##CREATE: Feature Engineering for train and test/validation dataset
for dataset in data_cleaner:
#Discrete variables
dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1
dataset['IsAlone'] = 1 #initialize to yes/1 is alone
dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1
#quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split
dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
#Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
#Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
#Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
#cleanup rare title names
#print(data1['Title'].value_counts())
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index
#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())
print("-"*10)
#preview data again
data1.info()
data_val.info()
data1.sample(10)
Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Name          891 non-null object
Sex           891 non-null object
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked      891 non-null object
FamilySize    891 non-null int64
IsAlone       891 non-null int64
Title         891 non-null object
FareBin       891 non-null category
AgeBin        891 non-null category
dtypes: category(2), float64(2), int64(6), object(4)
memory usage: 85.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 16 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
FamilySize     418 non-null int64
IsAlone        418 non-null int64
Title          418 non-null object
FareBin        418 non-null category
AgeBin         418 non-null category
dtypes: category(2), float64(2), int64(6), object(6)
memory usage: 46.8+ KB


































































































































































































Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareBin AgeBin
476 0 2 Renouf, Mr. Peter Henry male 34.0 1 0 21.0000 S 2 0 Mr (14.454, 31.0] (32.0, 48.0]
378 0 3 Betros, Mr. Tannous male 20.0 0 0 4.0125 C 1 1 Mr (-0.001, 7.91] (16.0, 32.0]
560 0 3 Morrow, Mr. Thomas Rowan male 28.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (16.0, 32.0]
338 1 3 Dahl, Mr. Karl Edwart male 45.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (32.0, 48.0]
724 1 1 Chambers, Mr. Norman Campbell male 27.0 1 0 53.1000 S 2 0 Mr (31.0, 512.329] (16.0, 32.0]
262 0 1 Taussig, Mr. Emil male 52.0 1 1 79.6500 S 3 0 Mr (31.0, 512.329] (48.0, 64.0]
792 0 3 Sage, Miss. Stella Anna female 28.0 8 2 69.5500 S 11 0 Miss (31.0, 512.329] (16.0, 32.0]
676 0 3 Sawyer, Mr. Frederick Charles male 24.5 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (16.0, 32.0]
556 1 1 Duff Gordon, Lady. (Lucille Christiana Sutherl… female 48.0 1 0 39.6000 C 2 0 Misc (31.0, 512.329] (32.0, 48.0]
565 0 3 Davies, Mr. Alfred J male 24.0 2 0 24.1500 S 3 0 Mr (14.454, 31.0] (16.0, 32.0]

3.23 Convert Formats
We will convert categorical data to dummy variables for mathematical analysis. There are multiple ways to encode categorical variables; we will use the sklearn and pandas functions.

In this step, we will also define our x (independent/features/explanatory/predictor/etc.) and y (dependent/target/outcome/response/etc.) variables for data modeling.

Developer Documentation:

3.23 转换格式

我们将分类数据转换为虚拟变量进行数学分析。有多种方法来编码分类变量。我们将使用sklearn和pandas函数。
在这一步中,我们还将定义我们的x(独立/特征/解释/预测等)和y(依赖/目标/结果/响应/等)变量进行数据建模。

开发者文档:

Categorical Encoding

Sklearn LabelEncoder

Sklearn OneHotEncoder

Pandas Categorical dtype

pandas.get_dummies

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset
#code categorical data
label = LabelEncoder()
for dataset in data_cleaner:
dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
dataset['Title_Code'] = label.fit_transform(dataset['Title'])
dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])
#define y variable aka target/outcome
Target = ['Survived']
#define x variables for original features aka feature selection
data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
data1_xy = Target + data1_x
print('Original X Y: ', data1_xy, '\n')
#define x variables for original w/bin features to remove continuous variables
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')
#define x and y variables for dummy features original
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')
data1_dummy.head()
Original X Y:  ['Survived', 'Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] 

Bin X Y:  ['Survived', 'Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code'] 

Dummy X Y:  ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Misc', 'Title_Miss', 'Title_Mr', 'Title_Mrs'] 































































































































Pclass SibSp Parch Age Fare FamilySize IsAlone Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Title_Master Title_Misc Title_Miss Title_Mr Title_Mrs
0 3 1 0 22.0 7.2500 2 0 0 1 0 0 1 0 0 0 1 0
1 1 1 0 38.0 71.2833 2 0 1 0 1 0 0 0 0 0 0 1
2 3 0 0 26.0 7.9250 1 1 1 0 0 0 1 0 0 1 0 0
3 1 1 0 35.0 53.1000 2 0 1 0 0 0 1 0 0 0 0 1
4 3 0 0 35.0 8.0500 1 1 0 1 0 0 1 0 0 0 1 0

3.24 Da-Double Check Cleaned Data

Now that we’ve cleaned our data, let’s do a discount da-double check!

3.24 再检查一次清洗后的数据

现在我们有了清洗过的数据了,让我们做个双重检查。

1
2
3
4
5
6
7
8
9
10
11
print('Train columns with null values: \n', data1.isnull().sum())
print("-"*10)
print (data1.info())
print("-"*10)
print('Test/Validation columns with null values: \n', data_val.isnull().sum())
print("-"*10)
print (data_val.info())
print("-"*10)
data_raw.describe(include = 'all')
Train columns with null values: 
 Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Fare             0
Embarked         0
FamilySize       0
IsAlone          0
Title            0
FareBin          0
AgeBin           0
Sex_Code         0
Embarked_Code    0
Title_Code       0
AgeBin_Code      0
FareBin_Code     0
dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 19 columns):
Survived         891 non-null int64
Pclass           891 non-null int64
Name             891 non-null object
Sex              891 non-null object
Age              891 non-null float64
SibSp            891 non-null int64
Parch            891 non-null int64
Fare             891 non-null float64
Embarked         891 non-null object
FamilySize       891 non-null int64
IsAlone          891 non-null int64
Title            891 non-null object
FareBin          891 non-null category
AgeBin           891 non-null category
Sex_Code         891 non-null int64
Embarked_Code    891 non-null int64
Title_Code       891 non-null int64
AgeBin_Code      891 non-null int64
FareBin_Code     891 non-null int64
dtypes: category(2), float64(2), int64(11), object(4)
memory usage: 120.3+ KB
None
----------
Test/Validation columns with null values: 
 PassengerId        0
Pclass             0
Name               0
Sex                0
Age                0
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            327
Embarked           0
FamilySize         0
IsAlone            0
Title              0
FareBin            0
AgeBin             0
Sex_Code           0
Embarked_Code      0
Title_Code         0
AgeBin_Code        0
FareBin_Code       0
dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId      418 non-null int64
Pclass           418 non-null int64
Name             418 non-null object
Sex              418 non-null object
Age              418 non-null float64
SibSp            418 non-null int64
Parch            418 non-null int64
Ticket           418 non-null object
Fare             418 non-null float64
Cabin            91 non-null object
Embarked         418 non-null object
FamilySize       418 non-null int64
IsAlone          418 non-null int64
Title            418 non-null object
FareBin          418 non-null category
AgeBin           418 non-null category
Sex_Code         418 non-null int64
Embarked_Code    418 non-null int64
Title_Code       418 non-null int64
AgeBin_Code      418 non-null int64
FareBin_Code     418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------



























































































































































































PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Hoyt, Mr. William Fisher male NaN NaN NaN 347082 NaN G6 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN

3.25 Split Training and Testing Data

As mentioned previously, the test file provided is really validation data for competition submission. So, we will use sklearn function to split the training data in two datasets; 75/25 split. This is important, so we don’t overfit our model. Meaning, the algorithm is so specific to a given subset, it cannot accurately generalize another subset, from the same dataset. It’s important our algorithm has not seen the subset we will use to test, so it doesn’t “cheat” by memorizing the answers. We will use sklearn’s train_test_split function. In later sections we will also use sklearn’s cross validation functions, that splits our dataset into train and test for data modeling comparison.

3.25分割训练和测试数据

如前所述,提供的测试文件实际上是竞争提交的验证数据。因此,我们将使用sklearn函数将训练数据分成两个数据集; 75/25分割。这很重要,所以我们不要过度使用我们的模型。这意味着,该算法对于给定的子集是特定的,不能从相同的数据集准确地概括另一个子集。重要的是我们的算法没有看到我们将用来测试的子集,所以它不会通过记住答案来“作弊”。我们将使用sklearn的train_test_split函数。在后面的章节中,我们还将使用sklearn的交叉验证函数,将我们的数据集分解为训练和测试以进行数据建模比较。

1
2
3
4
5
6
7
8
9
10
11
12
#split train and test data with function defaults
#random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)
print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))
train1_x_bin.head()
Data1 Shape: (891, 19)
Train1 Shape: (668, 8)
Test1 Shape: (223, 8)



































































Sex_Code Pclass Embarked_Code Title_Code FamilySize AgeBin_Code FareBin_Code
105 1 3 2 3 1 1 0
68 0 3 2 2 7 1 1
253 1 3 2 3 2 1 2
320 1 3 2 3 1 1 0
706 0 2 2 4 1 2 1

Step 4: Perform Exploratory Analysis with Statistics

Now that our data is cleaned, we will explore our data with descriptive and graphical statistics to describe and summarize our variables. In this stage, you will find yourself classifying features and determining their correlation with the target variable and each other.

第4步:用统计数据进行探索性分析

现在我们的数据已经清理完毕,我们将用描述性和图形化的统计数据来探索我们的数据来描述和总结我们的变量。在这个阶段,你会发现自己要分类特征,并确定它们与目标变量之间的相互关系。

1
2
3
4
5
6
7
8
9
10
11
#Discrete Variable Correlation by Survival using
#group by aka pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
for x in data1_x:
if data1[x].dtype != 'float64' :
print('Survival Correlation by:', x)
print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
print('-'*10, '\n')
#using crosstabs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html
print(pd.crosstab(data1['Title'],data1[Target[0]]))
Survival Correlation by: Sex
      Sex  Survived
0  female  0.742038
1    male  0.188908
---------- 

Survival Correlation by: Pclass
   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363
---------- 

Survival Correlation by: Embarked
  Embarked  Survived
0        C  0.553571
1        Q  0.389610
2        S  0.339009
---------- 

Survival Correlation by: Title
    Title  Survived
0  Master  0.575000
1    Misc  0.444444
2    Miss  0.697802
3      Mr  0.156673
4     Mrs  0.792000
---------- 

Survival Correlation by: SibSp
   SibSp  Survived
0      0  0.345395
1      1  0.535885
2      2  0.464286
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000
---------- 

Survival Correlation by: Parch
   Parch  Survived
0      0  0.343658
1      1  0.550847
2      2  0.500000
3      3  0.600000
4      4  0.000000
5      5  0.200000
6      6  0.000000
---------- 

Survival Correlation by: FamilySize
   FamilySize  Survived
0           1  0.303538
1           2  0.552795
2           3  0.578431
3           4  0.724138
4           5  0.200000
5           6  0.136364
6           7  0.333333
7           8  0.000000
8          11  0.000000
---------- 

Survival Correlation by: IsAlone
   IsAlone  Survived
0        0  0.505650
1        1  0.303538
---------- 

Survived    0    1
Title             
Master     17   23
Misc       15   12
Miss       55  127
Mr        436   81
Mrs        26   99
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#IMPORTANT: Intentionally plotted different ways for learning purposes only.
#optional plotting w/pandas: https://pandas.pydata.org/pandas-docs/stable/visualization.html
#we will use matplotlib.pyplot: https://matplotlib.org/api/pyplot_api.html
#to organize our graphics will use figure: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
#subplot: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot
#and subplotS: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=matplotlib%20pyplot%20subplots#matplotlib.pyplot.subplots
#graph distribution of quantitative data
plt.figure(figsize=[16,12])
plt.subplot(231)
plt.boxplot(x=data1['Fare'], showmeans = True, meanline = True)
plt.title('Fare Boxplot')
plt.ylabel('Fare ($)')
plt.subplot(232)
plt.boxplot(data1['Age'], showmeans = True, meanline = True)
plt.title('Age Boxplot')
plt.ylabel('Age (Years)')
plt.subplot(233)
plt.boxplot(data1['FamilySize'], showmeans = True, meanline = True)
plt.title('Family Size Boxplot')
plt.ylabel('Family Size (#)')
plt.subplot(234)
plt.hist(x = [data1[data1['Survived']==1]['Fare'], data1[data1['Survived']==0]['Fare']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Fare Histogram by Survival')
plt.xlabel('Fare ($)')
plt.ylabel('# of Passengers')
plt.legend()
plt.subplot(235)
plt.hist(x = [data1[data1['Survived']==1]['Age'], data1[data1['Survived']==0]['Age']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()
plt.subplot(236)
plt.hist(x = [data1[data1['Survived']==1]['FamilySize'], data1[data1['Survived']==0]['FamilySize']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Family Size Histogram by Survival')
plt.xlabel('Family Size (#)')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x17ac97d1e10>

1
2
3
4
5
6
7
8
9
10
11
12
#we will use seaborn graphics for multi-variable comparison: https://seaborn.pydata.org/api.html
#graph individual features by survival
fig, saxis = plt.subplots(2, 3,figsize=(16,12))
sns.barplot(x = 'Embarked', y = 'Survived', data=data1, ax = saxis[0,0])
sns.barplot(x = 'Pclass', y = 'Survived', order=[1,2,3], data=data1, ax = saxis[0,1])
sns.barplot(x = 'IsAlone', y = 'Survived', order=[1,0], data=data1, ax = saxis[0,2])
sns.pointplot(x = 'FareBin', y = 'Survived', data=data1, ax = saxis[1,0])
sns.pointplot(x = 'AgeBin', y = 'Survived', data=data1, ax = saxis[1,1])
sns.pointplot(x = 'FamilySize', y = 'Survived', data=data1, ax = saxis[1,2])
<matplotlib.axes._subplots.AxesSubplot at 0x17ac9a5ea20>

1
2
3
4
5
6
7
8
9
10
11
12
#graph distribution of qualitative data: Pclass
#we know class mattered in survival, now let's compare class and a 2nd feature
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(14,12))
sns.boxplot(x = 'Pclass', y = 'Fare', hue = 'Survived', data = data1, ax = axis1)
axis1.set_title('Pclass vs Fare Survival Comparison')
sns.violinplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = data1, split = True, ax = axis2)
axis2.set_title('Pclass vs Age Survival Comparison')
sns.boxplot(x = 'Pclass', y ='FamilySize', hue = 'Survived', data = data1, ax = axis3)
axis3.set_title('Pclass vs Family Size Survival Comparison')
<matplotlib.text.Text at 0x17aca08ab70>

1
2
3
4
5
6
7
8
9
10
11
12
#graph distribution of qualitative data: Sex
#we know sex mattered in survival, now let's compare sex and a 2nd feature
fig, qaxis = plt.subplots(1,3,figsize=(14,12))
sns.barplot(x = 'Sex', y = 'Survived', hue = 'Embarked', data=data1, ax = qaxis[0])
axis1.set_title('Sex vs Embarked Survival Comparison')
sns.barplot(x = 'Sex', y = 'Survived', hue = 'Pclass', data=data1, ax = qaxis[1])
axis1.set_title('Sex vs Pclass Survival Comparison')
sns.barplot(x = 'Sex', y = 'Survived', hue = 'IsAlone', data=data1, ax = qaxis[2])
axis1.set_title('Sex vs IsAlone Survival Comparison')
<matplotlib.text.Text at 0x17ac9c242b0>

1
2
3
4
5
6
7
8
9
10
11
12
#more side-by-side comparisons
fig, (maxis1, maxis2) = plt.subplots(1, 2,figsize=(14,12))
#how does family size factor with sex & survival compare
sns.pointplot(x="FamilySize", y="Survived", hue="Sex", data=data1,
palette={"male": "blue", "female": "pink"},
markers=["*", "o"], linestyles=["-", "--"], ax = maxis1)
#how does class factor with sex & survival compare
sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data1,
palette={"male": "blue", "female": "pink"},
markers=["*", "o"], linestyles=["-", "--"], ax = maxis2)
<matplotlib.axes._subplots.AxesSubplot at 0x17acacbbdd8>

1
2
3
4
5
#how does embark port factor with class, sex, and survival compare
#facetgrid: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
e = sns.FacetGrid(data1, col = 'Embarked')
e.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', ci=95.0, palette = 'deep')
e.add_legend()
<seaborn.axisgrid.FacetGrid at 0x17ac91fe710>

1
2
3
4
5
#plot distributions of age of passengers who survived or did not survive
a = sns.FacetGrid( data1, hue = 'Survived', aspect=4 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , data1['Age'].max()))
a.add_legend()
<seaborn.axisgrid.FacetGrid at 0x17acaf6ed68>

1
2
3
4
#histogram comparison of sex, class, and age by survival
h = sns.FacetGrid(data1, row = 'Sex', col = 'Pclass', hue = 'Survived')
h.map(plt.hist, 'Age', alpha = .75)
h.add_legend()
<seaborn.axisgrid.FacetGrid at 0x17acb07deb8>

1
2
3
#pair plots of entire dataset
pp = sns.pairplot(data1, hue = 'Survived', palette = 'deep', size=1.2, diag_kind = 'kde', diag_kws=dict(shade=True), plot_kws=dict(s=10) )
pp.set(xticklabels=[])
<seaborn.axisgrid.PairGrid at 0x17acc56fd30>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#correlation heatmap of dataset
def correlation_heatmap(df):
_ , ax = plt.subplots(figsize =(14, 12))
colormap = sns.diverging_palette(220, 10, as_cmap = True)
_ = sns.heatmap(
df.corr(),
cmap = colormap,
square=True,
cbar_kws={'shrink':.9 },
ax=ax,
annot=True,
linewidths=0.1,vmax=1.0, linecolor='white',
annot_kws={'fontsize':12 }
)
plt.title('Pearson Correlation of Features', y=1.05, size=15)
correlation_heatmap(data1)

Step 5: Model Data

Data Science is a multi-disciplinary field between mathematics (i.e. statistics, linear algebra, etc.), computer science (i.e. programming languages, computer systems, etc.) and business management (i.e. communication, subject-matter knowledge, etc.). Most data scientist come from one of the three fields, so they tend to lean towards that discipline. However, data science is like a three-legged stool, with no one leg being more important than the other. So, this step will require advanced knowledge in mathematics. But don’t worry, we only need a high-level overview, which we’ll cover in this Kernel. Also, thanks to computer science, a lot of the heavy lifting is done for you. So, problems that once required graduate degrees in mathematics or statistics, now only take a few lines of code. Last, we’ll need some business acumen to think through the problem. After all, like training a sight-seeing dog, it’s learning from us and not the other way around.

Machine Learning (ML), as the name suggest, is teaching the machine how-to think and not what to think. While this topic and big data has been around for decades, it is becoming more popular than ever because the barrier to entry is lower, for businesses and professionals alike. This is both good and bad. It’s good because these algorithms are now accessible to more people that can solve more problems in the real-world. It’s bad because a lower barrier to entry means, more people will not know the tools they are using and can come to incorrect conclusions. That’s why I focus on teaching you, not just what to do, but why you’re doing it. Previously, I used the analogy of asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible; or even worst, implements incorrect actionable intelligence. So now that I’ve hammered (no pun intended) my point, I’ll show you what to do and most importantly, WHY you do it.

First, you must understand, that the purpose of machine learning is to solve human problems. Machine learning can be categorized as: supervised learning, unsupervised learning, and reinforced learning. Supervised learning is where you train the model by presenting it a training dataset that includes the correct answer. Unsupervised learning is where you train the model using a training dataset that does not include the correct answer. And reinforced learning is a hybrid of the previous two, where the model is not given the correct answer immediately, but later after a sequence of events to reinforce learning. We are doing supervised machine learning, because we are training our algorithm by presenting it with a set of features and their corresponding target. We then hope to present it a new subset from the same dataset and have similar results in prediction accuracy.

There are many machine learning algorithms, however they can be reduced to four categories: classification, regression, clustering, or dimensionality reduction, depending on your target variable and data modeling goals. We’ll save clustering and dimension reduction for another day, and focus on classification and regression. We can generalize that a continuous target variable requires a regression algorithm and a discrete target variable requires a classification algorithm. One side note, logistic regression, while it has regression in the name, is really a classification algorithm. Since our problem is predicting if a passenger survived or did not survive, this is a discrete target variable. We will use a classification algorithm from the sklearn library to begin our analysis. We will use cross validation and scoring metrics, discussed in later sections, to rank and compare our algorithms’ performance.

第五步:模型数据

数据科学是数学(即统计学,线性代数等),计算机科学(即编程语言,计算机系统等)和商业管理(即通信,主题知识等)之间的多学科领域。大多数数据科学家来自三个领域之一,所以他们往往倾向于这个学科。然而,数据科学就像三脚凳,没有一条腿比另一条更重要。所以,这一步需要先进的数学知识。但是不要担心,我们只需要一个高级的概述,我们将在这个内核中介绍。此外,由于计算机科学,很多繁重的工作都是为你完成的。所以,曾经需要数学或统计学硕士学位的问题,现在只需要几行代码。持续,我们需要一些商业头脑来思考问题。毕竟,像培训一只观光狗一样,它正在向我们学习,而不是相反。
机器学习(ML),顾名思义,是教机器如何思考,而不是思考什么。虽然这个话题和大数据已经存在了数十年,但是它变得比以往更受欢迎,因为对于企业和专业人士而言,进入门槛较低。这是好的和坏的。这很好,因为这些算法现在可以被更多的人使用,可以在现实世界中解决更多的问题。这是不好的,因为进入壁垒较低意味着更多的人不知道他们正在使用的工具,并可能得出不正确的结论。这就是为什么我专注于教导你,而不仅仅是做什么,但为什么你这样做。以前,我用一个叫菲利普螺丝刀的类比,他们给你一把一字螺丝刀,或者一把锤子给你。充其量,它显示完全缺乏了解。最糟糕的是,它使完成这个项目变得不可能; 甚至最差,实施不正确的可操作的情报。所以,现在我已经敲定了(没有双关语意思的)我的观点,我会告诉你该怎么做,最重要的是,你为什么这么做。
首先,你必须明白,机器学习的目的是解决人类的问题。机器学习可以分类为:监督学习,无监督学习和强化学习。监督式学习是通过向其提供包含正确答案的训练数据集来训练模型的地方。无监督学习是使用不包含正确答案的训练数据集训练模型的地方。而强化学习是前两者的混合体,模型没有立即给出正确的答案,而是在一系列事件之后加强学习。我们正在做有监督的机器学习,因为我们正在训练我们的算法,给出一套功能和相应的目标。
有许多机器学习算法,但是它们可以归结为四类:分类,回归,聚类或降维,这取决于您的目标变量和数据建模目标。我们将保存另一天的聚类和降维,并着重于分类和回归。我们可以概括一个连续的目标变量需要一个回归算法和一个离散的目标变量需要一个分类算法。一方面,logistic回归虽然在名称上有回归,但实际上却是一种分类算法。由于我们的问题是预测乘客是幸存还是不存活,这是一个离散的目标变量。我们将使用sklearn库中的分类算法来开始我们的分析。我们将使用交叉验证和评分指标。

Machine Learning Selection:

Now that we identified our solution as a supervised learning classification algorithm. We can narrow our list of choices.

机器学习选择:

Machine Learning Classification Algorithms:

机器学习分类算法:

Data Science 101: How to Choose a Machine Learning Algorithm (MLA)

IMPORTANT: When it comes to data modeling, the beginner’s question is always, “what is the best machine learning algorithm?” To this the beginner must learn, the No Free Lunch Theorem (NFLT) of Machine Learning. In short, NFLT states, there is no super algorithm, that works best in all situations, for all datasets. So the best approach is to try multiple MLAs, tune them, and compare them for your specific scenario. With that being said, some good research has been done to compare algorithms, such as Caruana & Niculescu-Mizil 2006 watch video lecture here of MLA comparisons, Ogutu et al. 2011 done by the NIH for genomic selection, Fernandez-Delgado et al. 2014 comparing 179 classifiers from 17 families, Thoma 2016 sklearn comparison, and there is also a school of thought that says, more data beats a better algorithm.

So with all this information, where is a beginner to start? I recommend starting with Trees, Bagging, Random Forests, and Boosting. They are basically different implementations of a decision tree, which is the easiest concept to learn and understand. They are also easier to tune, discussed in the next section, than something like SVC. Below, I’ll give an overview of how-to run and compare several MLAs, but the rest of this Kernel will focus on learning data modeling via decision trees and its derivatives.

数据科学101:如何选择机器学习算法(MLA)

重要提示:在数据建模方面,初学者的问题总是“什么是最好的机器学习算法?对此,初学者必须学习机器学习的免费午餐定理(NFLT)。简而言之,NFLT指出,对于所有数据集,没有超级算法,在所有情况下效果最好。所以最好的方法是尝试多个MLA,调整它们,并针对您的具体情况进行比较。据说,已经做了一些很好的研究来比较算法,比如Caruana和Niculescu-Mizil 2006观看视频讲座,MLA比较,Ogutu等人。2011年由NIH进行基因组选择,Fernandez-Delgado等人 2014年比较来自17个家庭的179个分类器,Thoma 2016年的sklearn比较,而且还有一个学派认为,更多的数据击败更好的算法。
所有这些信息,初学者在哪里开始呢?我建议从树木,套袋,随机森林和提升开始。他们基本上是决策树的不同实现,这是最容易学习和理解的概念。它们也比SVC更容易调整,在下一节讨论。下面,我将概述如何运行和比较几个MLA,但是这个内核的其余部分将集中在通过决策树及其衍生物学习数据建模。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
#Ensemble Methods
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
#Gaussian Processes
gaussian_process.GaussianProcessClassifier(),
#GLM
linear_model.LogisticRegressionCV(),
linear_model.PassiveAggressiveClassifier(),
linear_model.RidgeClassifierCV(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
#Navies Bayes
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
#Nearest Neighbor
neighbors.KNeighborsClassifier(),
#SVM
svm.SVC(probability=True),
svm.NuSVC(probability=True),
svm.LinearSVC(),
#Trees
tree.DecisionTreeClassifier(),
tree.ExtraTreeClassifier(),
#Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
XGBClassifier()
]
#split dataset in cross-validation with this splitter class: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit
#note: this is an alternative to train_test_split
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%
#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
#create table to compare MLA predictions
MLA_predict = data1[Target]
#index through MLA and save performance to table
row_index = 0
for alg in MLA:
#set name and parameters
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
#score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split)
MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()
#if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3 #let's know the worst that can happen!
#save MLA predictions - see section 6 for usage
alg.fit(data1[data1_x_bin], data1[Target])
MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])
row_index+=1
#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare
#MLA_predict






















































































































































































































MLA Name MLA Parameters MLA Train Accuracy Mean MLA Test Accuracy Mean MLA Test Accuracy 3*STD MLA Time
21 XGBClassifier {‘base_score’: 0.5, ‘booster’: ‘gbtree’, ‘cols… 0.856367 0.829478 0.0527546 0.0243111
14 SVC {‘C’: 1.0, ‘cache_size’: 200, ‘class_weight’: … 0.837266 0.826119 0.0453876 0.0363276
3 GradientBoostingClassifier {‘criterion’: ‘friedman_mse’, ‘init’: None, ‘l… 0.866667 0.822761 0.0498731 0.0580237
15 NuSVC {‘cache_size’: 200, ‘class_weight’: None, ‘coe… 0.835768 0.822761 0.0493681 0.0380909
2 ExtraTreesClassifier {‘bootstrap’: False, ‘class_weight’: None, ‘cr… 0.895131 0.821642 0.0642658 0.0158152
17 DecisionTreeClassifier {‘class_weight’: None, ‘criterion’: ‘gini’, ‘m… 0.895131 0.821642 0.0572539 0.00240188
4 RandomForestClassifier {‘bootstrap’: True, ‘class_weight’: None, ‘cri… 0.89176 0.821269 0.0599586 0.0179999
1 BaggingClassifier {‘base_estimator’: None, ‘bootstrap’: True, ‘b… 0.891199 0.819403 0.0656928 0.0202108
13 KNeighborsClassifier {‘algorithm’: ‘auto’, ‘leaf_size’: 30, ‘metric… 0.850375 0.813806 0.0690863 0.00369971
0 AdaBoostClassifier {‘algorithm’: ‘SAMME.R’, ‘base_estimator’: Non… 0.820412 0.81194 0.0498606 0.0667009
5 GaussianProcessClassifier {‘copy_X_train’: True, ‘kernel’: None, ‘max_it… 0.871723 0.810448 0.0492537 0.204626
18 ExtraTreeClassifier {‘class_weight’: None, ‘criterion’: ‘gini’, ‘m… 0.895131 0.808955 0.0725111 0.00140331
20 QuadraticDiscriminantAnalysis {‘priors’: None, ‘reg_param’: 0.0, ‘store_cova… 0.821536 0.80709 0.0810389 0.00195031
8 RidgeClassifierCV {‘alphas’: (0.1, 1.0, 10.0), ‘class_weight’: N… 0.796629 0.79403 0.0360302 0.00429864
19 LinearDiscriminantAnalysis {‘n_components’: None, ‘priors’: None, ‘shrink… 0.796816 0.79403 0.0360302 0.00365217
16 LinearSVC {‘C’: 1.0, ‘class_weight’: None, ‘dual’: True,… 0.797378 0.79291 0.0410533 0.0267416
6 LogisticRegressionCV {‘Cs’: 10, ‘class_weight’: None, ‘cv’: None, ‘… 0.797004 0.790672 0.0653582 0.107427
12 GaussianNB {‘priors’: None} 0.794757 0.781343 0.0874568 0.00235162
11 BernoulliNB {‘alpha’: 1.0, ‘binarize’: 0.0, ‘class_prior’:… 0.785768 0.775373 0.0570347 0.00343521
10 Perceptron {‘alpha’: 0.0001, ‘class_weight’: None, ‘eta0’… 0.740075 0.728731 0.162221 0.00332923
9 SGDClassifier {‘alpha’: 0.0001, ‘average’: False, ‘class_wei… 0.714045 0.694776 0.245136 0.0019011
7 PassiveAggressiveClassifier {‘C’: 1.0, ‘average’: False, ‘class_weight’: N… 0.687079 0.673134 0.455343 0.00274937

1
2
3
4
5
6
7
#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')
#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')
<matplotlib.text.Text at 0x17acec6e940>

5.1 Evaluate Model Performance

Let’s recap, with some basic data cleaning, analysis, and machine learning algorithms (MLA), we are able to predict passenger survival with ~82% accuracy. Not bad for a few lines of code. But the question we always ask is, can we do better and more importantly get an ROI (return on investment) for our time invested? For example, if we’re only going to increase our accuracy by 1/10th of a percent, is it really worth 3-months of development. If you work in research maybe the answer is yes, but if you work in business mostly the answer is no. So, keep that in mind when improving your model.

5.1评估模型性能

现在让我们回顾一下基本的数据清洗,分析和机器学习算法(MLA),我们能够以82%的准确度预测乘客的生存。几行代码不错。但是我们总是问的问题是,我们可以做得更好,更重要的是为我们投入的时间获得投资回报(ROI)吗?例如,如果我们只是把精度提高了十分之一,是否真的值3个月的发展。如果你从事研究工作,也许答案是肯定的,但是如果你在事业上工作,那么答案是否定的。所以,在改进模型的时候记住这一点。

Data Science 101: Determine a Baseline Accuracy

Before we decide how-to make our model better, let’s determine if our model is even worth keeping. To do that, we have to go back to the basics of data science 101. We know this is a binary problem, because there are only two possible outcomes; passengers survived or died. So, think of it like a coin flip problem. If you have a fair coin and you guessed heads or tail, then you have a 50-50 chance of guessing correct. So, let’s set 50% as the worst model performance; because anything lower than that, then why do I need you when I can just flip a coin?

Okay, so with no information about the dataset, we can always get 50% with a binary problem. But we have information about the dataset, so we should be able to do better. We know that 1,502/2,224 or 67.5% of people died. Therefore, if we just predict the most frequent occurrence, that 100% of people died, then we would be right 67.5% of the time. So, let’s set 68% as bad model performance, because again, anything lower than that, then why do I need you, when I can just predict using the most frequent occurrence.

数据科学101:确定基线精度

在我们决定如何使模型更好之前,让我们来确定我们的模型是否值得保留。要做到这一点,我们必须回到数据科学的基础101.我们知道这是一个二元问题,因为只有两个可能的结果; 乘客幸存或死亡。所以,把它看作是一个硬币翻转问题。如果你有一个公平的硬币,并且你猜对了,那么你有50-50的机会猜测是正确的。所以,我们设置50%作为最差的模型表现; 因为比这更低的东西,那我为什么需要你呢?
好的,没有关于数据集的信息,我们总是可以得到50%的二元问题。但是我们有关于数据集的信息,所以我们应该可以做得更好。我们知道1,502 / 2,224或67.5%的人死亡。因此,如果我们只是预测最频繁的事件,100%的人死亡,那么我们就是67.5%。所以,我们把68%设定为不好的模型表现,因为再低一点,那么为什么我需要你呢?

Data Science 101: How-to Create Your Own Model

Our accuracy is increasing, but can we do better? Are there any signals in our data? To illustrate this, we’re going to build our own decision tree model, because it is the easiest to conceptualize and requires simple addition and multiplication calculations. When creating a decision tree, you want to ask questions that segment your target response, placing the survived/1 and dead/0 into homogeneous subgroups. This is part science and part art, so let’s just play the 21-question game to show you how it works. If you want to follow along on your own, download the train dataset and import into Excel. Create a pivot table with survival in the columns, count and % of row count in the values, and the features described below in the rows.

Remember, the name of the game is to create subgroups using a decision tree model to get survived/1 in one bucket and dead/0 in another bucket. Our rule of thumb will be the majority rules. Meaning, if the majority or 50% or more survived, then everybody in our subgroup survived/1, but if 50% or less survived then if everybody in our subgroup died/0. Also, we will stop if the subgroup is less than 10 and/or our model accuracy plateaus or decreases. Got it? Let’s go!

数据科学101:如何创建自己的模型

我们的准确度正在提高,但我们能做得更好吗?我们的数据中是否有任何信号?为了说明这一点,我们将建立我们自己的决策树模型,因为它是最简单的概念化,并且需要简单的加法和乘法计算。在创建决策树时,您需要提出将目标响应进行细分的问题,将存活/ 1和死/ 0置于同类小组中。这是科学和艺术的一部分,所以让我们玩21个问题的游戏,告诉你它是如何工作的。如果您想一个人继续,请下载火车数据集并导入到Excel中。在列中创建一个带有生存期的数据透视表,数值中行数的计数和百分比,以及下面在行中描述的特征。
请记住,游戏的名称是使用决策树模型创建子组,以在一个桶中生存/ 1,在另一个桶中死/ 0。我们的经验法则是多数规则。意思是说,如果大多数或50%或更多的人存活下来,那么我们小组中的每个人都存活/ 1,但是如果50%或更少存活,那么如果我们小组中的每个人都死亡/ 0。另外,如果子群小于10和/或我们的模型准确性高于或降低,我们将停止。得到它了?我们走吧!

Question 1: Were you on the Titanic?If Yes, then majority (62%) died. Note our sample survival is different than our population of 68%. Nonetheless, if we assumed everybody died, our sample accuracy is 62%.

问题1:你是在泰坦尼克号上吗?如果是,那么大概率会(62%)死亡。请注意,样本生存率与我们的68%的人口不同。不过,如果我们假设每个人都死了,我们的样本准确率是62%。

Question 2: Are you male or female? Male, majority (81%) died. Female, majority (74%) survived. Giving us an accuracy of 79%.

问题2:你是男性还是女性?男性占多数(81%)死亡。女性多数(74%)幸存下来。给我们一个79%的准确性。

Question 3A (going down the female branch with count = 314): Are you in class 1, 2, or 3? Class 1, majority (97%) survived and Class 2, majority (92%) survived. Since the dead subgroup is less than 10, we will stop going down this branch. Class 3, is even at a 50-50 split. No new information to improve our model is gained.

问题3A(计数= 314的女性分支):你在1,2,3级吗?1级,多数(97%)存活,2级,多数(92%)存活。由于死亡小组不足10人,我们将停止下去这个分支。3级,甚至在50-50分裂。没有新的信息来改善我们的模型。

Question 4A (going down the female class 3 branch with count = 144): Did you embark from port C, Q, or S? We gain a little information. C and Q, the majority still survived, so no change. Also, the dead subgroup is less than 10, so we will stop. S, the majority (63%) died. So, we will change females, class 3, embarked S from assuming they survived, to assuming they died. Our model accuracy increases to 81%.

问题4A(顺着女班三级分支,计数= 144):你是从C,Q,S出发的吗?我们获得一些信息。C和Q,大多数仍然存活,所以没有变化。另外,死亡小组小于10,所以我们将停止。S,大部分(63%)死亡。所以,我们会改变女性,三级,从假设他们存活下来,假设他们死了。我们的模型精度提高到81%。

Question 5A (going down the female class 3 embarked S branch with count = 88): So far, it looks like we made good decisions. Adding another level does not seem to gain much more information. This subgroup 55 died and 33 survived, since majority died we need to find a signal to identify the 33 or a subgroup to change them from dead to survived and improve our model accuracy. We can play with our features. One I found was fare 0-8, majority survived. It’s a small sample size 11-9, but one often used in statistics. We slightly improve our accuracy, but not much to move us past 82%. So, we’ll stop here.

问题5A(顺着女班3级走上S分支,计数= 88):到目前为止,看起来我们做出了很好的决定。增加另一个级别似乎没有获得更多的信息。这个亚群55死亡,33人幸存,因为大多数死亡,我们需要找到一个信号,以确定33或一个小组,以改变他们从死亡到存活,提高我们的模型的准确性。我们可以玩我们的功能。我发现一个是票价0-8,大多数幸存下来。这是一个11-9的小样本,但经常用于统计。我们稍微提高了准确度,但是没有太多让我们超过82%。所以,我们会在这里停下来。

Question 3B (going down the male branch with count = 577): Going back to question 2, we know the majority of males died. So, we are looking for a feature that identifies a subgroup that majority survived. Surprisingly, class or even embarked didn’t matter like it did for females, but title does and gets us to 82%. Guess and checking other features, none seem to push us past 82%. So, we’ll stop here for now.

问题3B(计数= 577的男性分支):回到问题2,我们知道大部分的男性死亡。所以,我们正在寻找一个功能,确定一个大多数幸存下来的小组。令人惊讶的是,上课或甚至上车并不像女性那样重要,但标题确实让我们达到了82%。猜测和检查其他功能,似乎没有推动我们超过82%。所以我们现在就停下来

You did it, with very little information, we get to 82% accuracy. On a worst, bad, good, better, and best scale, we’ll set 82% to good, since it’s a simple model that yields us decent results. But the question still remains, can we do better than our handmade model?

Before we do, let’s code what we just wrote above. Please note, this is a manual process created by “hand.” You won’t have to do this, but it’s important to understand it before you start working with MLA. Think of MLA like a TI-89 calculator on a Calculus Exam. It’s very powerful and helps you with a lot of the grunt work. But if you don’t know what you’re doing on the exam, a calculator, even a TI-89, is not going to help you pass. So, study the next section wisely.

Reference: Cross-Validation and Decision Tree Tutorial

你做到了,只有很少的信息,我们达到了82%的准确性。在最坏,最差,最好,最好和最好的规模上,我们将把82%设定为好,因为这是一个简单的模型,可以给我们带来不错的结果。但问题依然存在,我们能否比我们的手工制作模式做得更好?

在我们做之前,让我们编写我们刚刚写的。请注意,这是由“手”创建的手动过程。您不必这样做,但是在开始使用MLA之前,了解这一点非常重要。考虑微积分考试中的TI-89计算器。这是非常强大的,并帮助你很多的咕噜工作。但是如果你不知道自己在考试中做什么,那么计算器,甚至是TI-89,都不会帮你通过。所以,明智地研究下一部分。
参考:交叉验证和决策树教程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#IMPORTANT: This is a handmade model for learning purposes only.
#However, it is possible to create your own predictive model without a fancy algorithm :)
#coin flip model with random 1/survived 0/died
#iterate over dataFrame rows as (index, Series) pairs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html
for index, row in data1.iterrows():
#random number generator: https://docs.python.org/2/library/random.html
if random.random() > .5: # Random float x, 0.0 <= x < 1.0
data1.set_value(index, 'Random_Predict', 1) #predict survived/1
else:
data1.set_value(index, 'Random_Predict', 0) #predict died/0
#score random guess of survival. Use shortcut 1 = Right Guess and 0 = Wrong Guess
#the mean of the column will then equal the accuracy
data1['Random_Score'] = 0 #assume prediction wrong
data1.loc[(data1['Survived'] == data1['Random_Predict']), 'Random_Score'] = 1 #set to 1 for correct prediction
print('Coin Flip Model Accuracy: {:.2f}%'.format(data1['Random_Score'].mean()*100))
#we can also use scikit's accuracy_score function to save us a few lines of code
#http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
print('Coin Flip Model Accuracy w/SciKit: {:.2f}%'.format(metrics.accuracy_score(data1['Survived'], data1['Random_Predict'])*100))
Coin Flip Model Accuracy: 50.06%
Coin Flip Model Accuracy w/SciKit: 50.06%
1
2
3
4
5
6
#group by or pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
pivot_female = data1[data1.Sex=='female'].groupby(['Sex','Pclass', 'Embarked','FareBin'])['Survived'].mean()
print('Survival Decision Tree w/Female Node: \n',pivot_female)
pivot_male = data1[data1.Sex=='male'].groupby(['Sex','Title'])['Survived'].mean()
print('\n\nSurvival Decision Tree w/Male Node: \n',pivot_male)
Survival Decision Tree w/Female Node: 
 Sex     Pclass  Embarked  FareBin        
female  1       C         (14.454, 31.0]     0.666667
                          (31.0, 512.329]    1.000000
                Q         (31.0, 512.329]    1.000000
                S         (14.454, 31.0]     1.000000
                          (31.0, 512.329]    0.955556
        2       C         (7.91, 14.454]     1.000000
                          (14.454, 31.0]     1.000000
                          (31.0, 512.329]    1.000000
                Q         (7.91, 14.454]     1.000000
                S         (7.91, 14.454]     0.875000
                          (14.454, 31.0]     0.916667
                          (31.0, 512.329]    1.000000
        3       C         (-0.001, 7.91]     1.000000
                          (7.91, 14.454]     0.428571
                          (14.454, 31.0]     0.666667
                Q         (-0.001, 7.91]     0.750000
                          (7.91, 14.454]     0.500000
                          (14.454, 31.0]     0.714286
                S         (-0.001, 7.91]     0.533333
                          (7.91, 14.454]     0.448276
                          (14.454, 31.0]     0.357143
                          (31.0, 512.329]    0.125000
Name: Survived, dtype: float64


Survival Decision Tree w/Male Node: 
 Sex   Title 
male  Master    0.575000
      Misc      0.250000
      Mr        0.156673
Name: Survived, dtype: float64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#handmade data model using brain power (and Microsoft Excel Pivot Tables for quick calculations)
def mytree(df):
#initialize table to store predictions
Model = pd.DataFrame(data = {'Predict':[]})
male_title = ['Master'] #survived titles
for index, row in df.iterrows():
#Question 1: Were you on the Titanic; majority died
Model.loc[index, 'Predict'] = 0
#Question 2: Are you female; majority survived
if (df.loc[index, 'Sex'] == 'female'):
Model.loc[index, 'Predict'] = 1
#Question 3A Female - Class and Question 4 Embarked gain minimum information
#Question 5B Female - FareBin; set anything less than .5 in female node decision tree back to 0
if ((df.loc[index, 'Sex'] == 'female') &
(df.loc[index, 'Pclass'] == 3) &
(df.loc[index, 'Embarked'] == 'S') &
(df.loc[index, 'Fare'] > 8)
):
Model.loc[index, 'Predict'] = 0
#Question 3B Male: Title; set anything greater than .5 to 1 for majority survived
if ((df.loc[index, 'Sex'] == 'male') &
(df.loc[index, 'Title'] in male_title)
):
Model.loc[index, 'Predict'] = 1
return Model
#model data
Tree_Predict = mytree(data1)
print('Decision Tree Model Accuracy/Precision Score: {:.2f}%\n'.format(metrics.accuracy_score(data1['Survived'], Tree_Predict)*100))
#Accuracy Summary Report with http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
#Where recall score = (true positives)/(true positive + false negative) w/1 being best:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
#And F1 score = weighted average of precision and recall w/1 being best: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
print(metrics.classification_report(data1['Survived'], Tree_Predict))
Decision Tree Model Accuracy/Precision Score: 82.04%

             precision    recall  f1-score   support

          0       0.82      0.91      0.86       549
          1       0.82      0.68      0.75       342

avg / total       0.82      0.82      0.82       891
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#Plot Accuracy Summary
#Credit: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = metrics.confusion_matrix(data1['Survived'], Tree_Predict)
np.set_printoptions(precision=2)
class_names = ['Dead', 'Survived']
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized confusion matrix')
Confusion matrix, without normalization
[[497  52]
 [108 234]]
Normalized confusion matrix
[[ 0.91  0.09]
 [ 0.32  0.68]]

5.11 Model Performance with Cross-Validation (CV)

In step 5.0, we used sklearn cross_validate function to train, test, and score our model performance.

Remember, it’s important we use a different subset for train data to build our model and test data to evaluate our model. Otherwise, our model will be overfitted. Meaning it’s great at “predicting” data it’s already seen, but terrible at predicting data it has not seen; which is not prediction at all. It’s like cheating on a school quiz to get 100%, but then when you go to take the exam, you fail because you never truly learned anything. The same is true with machine learning.

CV is basically a shortcut to split and score our model multiple times, so we can get an idea of how well it will perform on unseen data. It’s a little more expensive in computer processing, but it’s important so we don’t gain false confidence. This is helpful in a Kaggle Competition or any use case where consistency matters and surprises should be avoided.

In addition to CV, we used a customized sklearn train test splitter, to allow a little more randomness in our test scoring. Below is an image of the default CV split.

5.11交叉验证的模型性能(CV)

在步骤5.0中,我们使用了sklearn cross_validate函数来训练,测试和评分我们的模型性能。
请记住,重要的是我们使用不同的子集来建立我们的模型和测试数据来评估我们的模型。否则,我们的模型将会过度配置。这意味着“预测”已经看到的数据是很好的,但是在预测未见的数据方面很糟糕; 这根本不是预测。这就像在学校测验中作弊得到100%,但当你参加考试时,你失败了,因为你从来没有真正学到任何东西。机器学习也是如此。
简历基本上是多次分割和评分我们模型的快捷方式,所以我们可以了解它在未看到的数据上表现如何。在计算机处理方面它的成本稍高一些,但重要的是我们不会得到错误的信心。这对于Kaggle比赛或任何应避免一致性和意外事件的使用案例是有帮助的。
除了CV之外,我们还使用了一个定制的sklearn火车测试分离器,在我们的测试评分中允许更多的随机性。以下是默认CV分割的图像。

5.12 Tune Model with Hyper-Parameters

When we used sklearn Decision Tree (DT) Classifier, we accepted all the function defaults. This leaves opportunity to see how various hyper-parameter settings will change the model accuracy. (Click here to learn more about parameters vs hyper-parameters.)

However, in order to tune a model, we need to actually understand it. That’s why I took the time in the previous sections to show you how predictions work. Now let’s learn a little bit more about our DT algorithm.

5.12使用超参数调整模型

当我们使用sklearn决策树(DT)分类器时,我们接受了所有的函数默认值。这让人们有机会看到各种超参数设置如何改变模型精度。(点击这里了解更多关于参数与超参数的信息。)
但是,为了调整模型,我们需要真正理解它。这就是为什么我在前几节中花时间向你展示预测是如何工作的。现在让我们多了解一下DT算法。

Credit: sklearn

Some advantages of decision trees are:

  • Simple to understand and to interpret. Trees can be visualized.
  • Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
  • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
  • Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.
  • Able to handle multi-output problems.
  • Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by Boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
  • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
  • Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
  • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
  • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
    Below are available hyper-parameters and defintions:

关于:sklearn
决策树的一些优点是:

  • 简单理解和解释。树可以被可视化。
  • 需要很少的数据准备。其他技术通常需要数据标准化,需要创建虚拟变量,并删除空白值。但请注意,该模块不支持缺少的值。
  • 使用树(即,预测数据)的成本是用于训练树的数据点的数量的对数。
  • 能够处理数字和分类数据。其他技术通常专门用于分析只有一种变量的数据集。查看更多信息的算法。
    能够处理多输出问题。
  • 使用白盒模型。如果一个给定的情况在模型中是可观察的,那么这个条件的解释就可以用布尔逻辑来解释。相比之下,在黑箱模型(例如,在人造神经网络中),结果可能更难以解释。
  • 可以使用统计测试来验证模型。这可以说明模型的可靠性。
  • 即使其假设受到数据生成的真实模型的某些违反,也能很好地执行。
    决策树的缺点包括:
  • 决策树学习者可以创建过于复杂的树,不能很好地概括数据。这被称为过度拟合。修剪(目前不支持)等机制,设置叶节点所需的最小样本数或设置树的最大深度是避免此问题所必需的。
  • 决策树可能是不稳定的,因为数据的小变化可能导致生成完全不同的树。通过在集合中使用决策树可以缓解这个问题。
  • 学习最优决策树的问题在最优化的几个方面甚至简单的概念上都被认为是NP完全的。因此,实际的决策树学习算法基于启发式算法,例如在每个节点处进行局部最优决策的贪婪算法。这样的算法不能保证返回全局最优的决策树。这可以通过在集合学习器中训练多个树来减轻,其中特征和样本随机地被替换。
  • 有些概念很难学,因为决策树不能很容易地表达它们,比如XOR,奇偶校验或多路复用器问题。
  • 决策树学习者如果某些类占主导地位,就会创建偏向性树。因此建议在拟合决策树之前平衡数据集。以下是可用的超参数和定义

(译者注:下面是一些sklearn的决策树方法展示)
class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
We will tune our model using ParameterGrid, GridSearchCV, and customized sklearn scoring; click here to learn more about ROC_AUC scores. We will then visualize our tree with graphviz. Click here to learn more about ROC_AUC scores.

我们将在我们的模型中使用的ParameterGrid, GridSearchCV, 以及自定义 sklearn 评分; 点击这里来了解更多关于ROC_AUC分数. 然后我们将用graphviz可视化我们的树 . 点击这里了解更多有关ROC_AUC分数的信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#base model
dtree = tree.DecisionTreeClassifier(random_state = 0)
base_results = model_selection.cross_validate(dtree, data1[data1_x_bin], data1[Target], cv = cv_split)
dtree.fit(data1[data1_x_bin], data1[Target])
print('BEFORE DT Parameters: ', dtree.get_params())
print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100))
print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
#print("BEFORE DT Test w/bin set score min: {:.2f}". format(base_results['test_score'].min()*100))
print('-'*10)
#tune hyper-parameters: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
param_grid = {'criterion': ['gini', 'entropy'], #scoring methodology; two supported formulas for calculating information gain - default is gini
#'splitter': ['best', 'random'], #splitting methodology; two supported strategies - default is best
'max_depth': [2,4,6,8,10,None], #max depth tree can grow; default is none
#'min_samples_split': [2,5,10,.03,.05], #minimum subset size BEFORE new split (fraction is % of total); default is 2
#'min_samples_leaf': [1,5,10,.03,.05], #minimum subset size AFTER new split split (fraction is % of total); default is 1
#'max_features': [None, 'auto'], #max features to consider when performing split; default none or all
'random_state': [0] #seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
}
#print(list(model_selection.ParameterGrid(param_grid)))
#choose best model with grid_search: #http://scikit-learn.org/stable/modules/grid_search.html#grid-search
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
tune_model.fit(data1[data1_x_bin], data1[Target])
#print(tune_model.cv_results_.keys())
#print(tune_model.cv_results_['params'])
print('AFTER DT Parameters: ', tune_model.best_params_)
#print(tune_model.cv_results_['mean_train_score'])
print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100))
#print(tune_model.cv_results_['mean_test_score'])
print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)
#duplicates gridsearchcv
#tune_results = model_selection.cross_validate(tune_model, data1[data1_x_bin], data1[Target], cv = cv_split)
#print('AFTER DT Parameters: ', tune_model.best_params_)
#print("AFTER DT Training w/bin set score mean: {:.2f}". format(tune_results['train_score'].mean()*100))
#print("AFTER DT Test w/bin set score mean: {:.2f}". format(tune_results['test_score'].mean()*100))
#print("AFTER DT Test w/bin set score min: {:.2f}". format(tune_results['test_score'].min()*100))
#print('-'*10)
BEFORE DT Parameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 0, 'splitter': 'best'}
BEFORE DT Training w/bin score mean: 89.51
BEFORE DT Test w/bin score mean: 82.09
BEFORE DT Test w/bin score 3*std: +/- 5.57
----------
AFTER DT Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT Training w/bin score mean: 89.35
AFTER DT Test w/bin score mean: 87.40
AFTER DT Test w/bin score 3*std: +/- 5.00
----------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#Graph MLA version of Decision Tree: http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
# if raise FileNotFoundError please run [ conda install graphviz ] with cmd
# question link: https://stackoverflow.com/questions/28312534/graphvizs-executables-are-not-found-python-3-4
# how to install graphviz:
# 1. pip install graphviz
# 2. download :https://graphviz.gitlab.io/_pages/Download/windows/graphviz-2.38.msi
# 3. install it and remmenber your install Path(like: C:\Program Files (x86)\Graphviz2.38\bin)
# 4. setting Path: To modify PATH goto Control Panel > System and Security > System > Advanced System Settings > Environment Variables > Path > Edit > New
# 5. restart your computer
# 以上我用英文说明能看懂吧
import graphviz
dot_data = tree.export_graphviz(dtree, out_file=None,
feature_names = data1_x_bin, class_names = True,
filled = True, rounded = True)
graph = graphviz.Source(dot_data)
# output this
#graph.render('treemodel.gv')
graph


svg

Step 6: Validate and Implement

The next step is to prepare for submission using the validation data

第6步:验证和实施

下一步是使用验证数据准备提交

1
2
3
#compare algorithm predictions with each other, where 1 = exactly similar and 0 = exactly opposite
#there are some 1's, but enough blues and light reds to create a "super algorithm" by combining them
correlation_heatmap(MLA_predict)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#why choose one model, when you can pick them all with voting classifier
#http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
#removed models w/o attribute 'predict_proba' required for vote classifier and models with a 1.0 correlation to another model
vote_est = [
#Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
('ada', ensemble.AdaBoostClassifier()),
('bc', ensemble.BaggingClassifier()),
('etc',ensemble.ExtraTreesClassifier()),
('gbc', ensemble.GradientBoostingClassifier()),
('rfc', ensemble.RandomForestClassifier()),
#Gaussian Processes: http://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-gpc
('gpc', gaussian_process.GaussianProcessClassifier()),
#GLM: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
('lr', linear_model.LogisticRegressionCV()),
#Navies Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html
('bnb', naive_bayes.BernoulliNB()),
('gnb', naive_bayes.GaussianNB()),
#Nearest Neighbor: http://scikit-learn.org/stable/modules/neighbors.html
('knn', neighbors.KNeighborsClassifier()),
#SVM: http://scikit-learn.org/stable/modules/svm.html
('svc', svm.SVC(probability=True)),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
('xgb', XGBClassifier())
]
#Hard Vote or majority rules
vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, data1[data1_x_bin], data1[Target], cv = cv_split)
vote_hard.fit(data1[data1_x_bin], data1[Target])
print("Hard Voting Training w/bin score mean: {:.2f}". format(vote_hard_cv['train_score'].mean()*100))
print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-'*10)
#Soft Vote or weighted probabilities
vote_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
vote_soft_cv = model_selection.cross_validate(vote_soft, data1[data1_x_bin], data1[Target], cv = cv_split)
vote_soft.fit(data1[data1_x_bin], data1[Target])
print("Soft Voting Training w/bin score mean: {:.2f}". format(vote_soft_cv['train_score'].mean()*100))
print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100))
print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3))
print('-'*10)
Hard Voting Training w/bin score mean: 86.61
Hard Voting Test w/bin score mean: 82.46
Hard Voting Test w/bin score 3*std: +/- 4.34
----------
Soft Voting Training w/bin score mean: 87.23
Soft Voting Test w/bin score mean: 82.24
Soft Voting Test w/bin score 3*std: +/- 4.94
----------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
#WARNING: Running is very computational intensive and time expensive.
#Code is written for experimental/developmental purposes and not production ready!
#Hyperparameter Tune with GridSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]
grid_param = [
[{
#AdaBoostClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
'n_estimators': grid_n_estimator, #default=50
'learning_rate': grid_learn, #default=1
#'algorithm': ['SAMME', 'SAMME.R'], #default=’SAMME.R
'random_state': grid_seed
}],
[{
#BaggingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
'n_estimators': grid_n_estimator, #default=10
'max_samples': grid_ratio, #default=1.0
'random_state': grid_seed
}],
[{
#ExtraTreesClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
'n_estimators': grid_n_estimator, #default=10
'criterion': grid_criterion, #default=”gini”
'max_depth': grid_max_depth, #default=None
'random_state': grid_seed
}],
[{
#GradientBoostingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
#'loss': ['deviance', 'exponential'], #default=’deviance’
'learning_rate': [.05], #default=0.1 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
'n_estimators': [300], #default=100 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
#'criterion': ['friedman_mse', 'mse', 'mae'], #default=”friedman_mse”
'max_depth': grid_max_depth, #default=3
'random_state': grid_seed
}],
[{
#RandomForestClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
'n_estimators': grid_n_estimator, #default=10
'criterion': grid_criterion, #default=”gini”
'max_depth': grid_max_depth, #default=None
'oob_score': [True], #default=False -- 12/31/17 set to reduce runtime -- The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 146.35 seconds.
'random_state': grid_seed
}],
[{
#GaussianProcessClassifier
'max_iter_predict': grid_n_estimator, #default: 100
'random_state': grid_seed
}],
[{
#LogisticRegressionCV - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
'fit_intercept': grid_bool, #default: True
#'penalty': ['l1','l2'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], #default: lbfgs
'random_state': grid_seed
}],
[{
#BernoulliNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB
'alpha': grid_ratio, #default: 1.0
}],
#GaussianNB -
[{}],
[{
#KNeighborsClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
'n_neighbors': [1,2,3,4,5,6,7], #default: 5
'weights': ['uniform', 'distance'], #default = ‘uniform’
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}],
[{
#SVC - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
#http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r
#'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'C': [1,2,3,4,5], #default=1.0
'gamma': grid_ratio, #edfault: auto
'decision_function_shape': ['ovo', 'ovr'], #default:ovr
'probability': [True],
'random_state': grid_seed
}],
[{
#XGBClassifier - http://xgboost.readthedocs.io/en/latest/parameter.html
'learning_rate': grid_learn, #default: .3
'max_depth': [1,2,4,6,8,10], #default 2
'n_estimators': grid_n_estimator,
'seed': grid_seed
}]
]
start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip
#print(clf[1]) #vote_est is a list of tuples, index 0 is the name and index 1 is the algorithm
#print(param)
start = time.perf_counter()
best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
best_search.fit(data1[data1_x_bin], data1[Target])
run = time.perf_counter() - start
best_param = best_search.best_params_
print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
clf[1].set_params(**best_param)
run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
print('-'*10)
The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 38.14 seconds.
The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 39.18 seconds.
The best parameter for ExtraTreesClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0} with a runtime of 71.71 seconds.
The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 38.23 seconds.
The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 85.22 seconds.
The best parameter for GaussianProcessClassifier is {'max_iter_predict': 10, 'random_state': 0} with a runtime of 9.17 seconds.
The best parameter for LogisticRegressionCV is {'fit_intercept': True, 'random_state': 0, 'solver': 'liblinear'} with a runtime of 9.05 seconds.
The best parameter for BernoulliNB is {'alpha': 0.1} with a runtime of 0.29 seconds.
The best parameter for GaussianNB is {} with a runtime of 0.06 seconds.
The best parameter for KNeighborsClassifier is {'algorithm': 'brute', 'n_neighbors': 7, 'weights': 'uniform'} with a runtime of 6.09 seconds.
The best parameter for SVC is {'C': 2, 'decision_function_shape': 'ovo', 'gamma': 0.1, 'probability': True, 'random_state': 0} with a runtime of 22.73 seconds.
The best parameter for XGBClassifier is {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 52.34 seconds.
Total optimization time was 6.20 minutes.
----------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#Hard Vote or majority rules w/Tuned Hyperparameters
grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
grid_hard_cv = model_selection.cross_validate(grid_hard, data1[data1_x_bin], data1[Target], cv = cv_split)
grid_hard.fit(data1[data1_x_bin], data1[Target])
print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
print('-'*10)
#Soft Vote or weighted probabilities w/Tuned Hyperparameters
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, data1[data1_x_bin], data1[Target], cv = cv_split)
grid_soft.fit(data1[data1_x_bin], data1[Target])
print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
print('-'*10)
#12/31/17 tuned with data1_x_bin
#The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 33.39 seconds.
#The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 30.28 seconds.
#The best parameter for ExtraTreesClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0} with a runtime of 64.76 seconds.
#The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 34.35 seconds.
#The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 76.32 seconds.
#The best parameter for GaussianProcessClassifier is {'max_iter_predict': 10, 'random_state': 0} with a runtime of 6.01 seconds.
#The best parameter for LogisticRegressionCV is {'fit_intercept': True, 'random_state': 0, 'solver': 'liblinear'} with a runtime of 8.04 seconds.
#The best parameter for BernoulliNB is {'alpha': 0.1} with a runtime of 0.19 seconds.
#The best parameter for GaussianNB is {} with a runtime of 0.04 seconds.
#The best parameter for KNeighborsClassifier is {'algorithm': 'brute', 'n_neighbors': 7, 'weights': 'uniform'} with a runtime of 4.84 seconds.
#The best parameter for SVC is {'C': 2, 'decision_function_shape': 'ovo', 'gamma': 0.1, 'probability': True, 'random_state': 0} with a runtime of 29.39 seconds.
#The best parameter for XGBClassifier is {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 46.23 seconds.
#Total optimization time was 5.56 minutes.
Hard Voting w/Tuned Hyperparameters Training w/bin score mean: 85.22
Hard Voting w/Tuned Hyperparameters Test w/bin score mean: 82.31
Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.26
----------
Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 84.76
Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 82.28
Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.42
----------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#prepare data for modeling
print(data_val.info())
print("-"*10)
#data_val.sample(10)
#handmade decision tree - submission score = 0.77990
data_val['Survived'] = mytree(data_val).astype(int)
#decision tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_dt = tree.DecisionTreeClassifier()
#submit_dt = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
#submit_dt.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_dt.best_params_) #Best Parameters: {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
#data_val['Survived'] = submit_dt.predict(data_val[data1_x_bin])
#bagging w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77990
#submit_bc = ensemble.BaggingClassifier()
#submit_bc = model_selection.GridSearchCV(ensemble.BaggingClassifier(), param_grid= {'n_estimators':grid_n_estimator, 'max_samples': grid_ratio, 'oob_score': grid_bool, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_bc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_bc.best_params_) #Best Parameters: {'max_samples': 0.25, 'n_estimators': 500, 'oob_score': True, 'random_state': 0}
#data_val['Survived'] = submit_bc.predict(data_val[data1_x_bin])
#extra tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_etc = ensemble.ExtraTreesClassifier()
#submit_etc = model_selection.GridSearchCV(ensemble.ExtraTreesClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_etc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_etc.best_params_) #Best Parameters: {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_etc.predict(data_val[data1_x_bin])
#random foreset w/full dataset modeling submission score: defaults= 0.71291, tuned= 0.73205
#submit_rfc = ensemble.RandomForestClassifier()
#submit_rfc = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_rfc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_rfc.best_params_) #Best Parameters: {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_rfc.predict(data_val[data1_x_bin])
#ada boosting w/full dataset modeling submission score: defaults= 0.74162, tuned= 0.75119
#submit_abc = ensemble.AdaBoostClassifier()
#submit_abc = model_selection.GridSearchCV(ensemble.AdaBoostClassifier(), param_grid={'n_estimators': grid_n_estimator, 'learning_rate': grid_ratio, 'algorithm': ['SAMME', 'SAMME.R'], 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_abc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_abc.best_params_) #Best Parameters: {'algorithm': 'SAMME.R', 'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0}
#data_val['Survived'] = submit_abc.predict(data_val[data1_x_bin])
#gradient boosting w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77033
#submit_gbc = ensemble.GradientBoostingClassifier()
#submit_gbc = model_selection.GridSearchCV(ensemble.GradientBoostingClassifier(), param_grid={'learning_rate': grid_ratio, 'n_estimators': grid_n_estimator, 'max_depth': grid_max_depth, 'random_state':grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_gbc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_gbc.best_params_) #Best Parameters: {'learning_rate': 0.25, 'max_depth': 2, 'n_estimators': 50, 'random_state': 0}
#data_val['Survived'] = submit_gbc.predict(data_val[data1_x_bin])
#extreme boosting w/full dataset modeling submission score: defaults= 0.73684, tuned= 0.77990
#submit_xgb = XGBClassifier()
#submit_xgb = model_selection.GridSearchCV(XGBClassifier(), param_grid= {'learning_rate': grid_learn, 'max_depth': [0,2,4,6,8,10], 'n_estimators': grid_n_estimator, 'seed': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_xgb.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_xgb.best_params_) #Best Parameters: {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0}
#data_val['Survived'] = submit_xgb.predict(data_val[data1_x_bin])
#hard voting classifier w/full dataset modeling submission score: defaults= 0.75598, tuned = 0.77990
#data_val['Survived'] = vote_hard.predict(data_val[data1_x_bin])
data_val['Survived'] = grid_hard.predict(data_val[data1_x_bin])
#soft voting classifier w/full dataset modeling submission score: defaults= 0.73684, tuned = 0.74162
#data_val['Survived'] = vote_soft.predict(data_val[data1_x_bin])
#data_val['Survived'] = grid_soft.predict(data_val[data1_x_bin])
#submit file
submit = data_val[['PassengerId','Survived']]
submit.to_csv("submit.csv", index=False)
print('Validation Data Distribution: \n', data_val['Survived'].value_counts(normalize = True))
submit.sample(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId      418 non-null int64
Pclass           418 non-null int64
Name             418 non-null object
Sex              418 non-null object
Age              418 non-null float64
SibSp            418 non-null int64
Parch            418 non-null int64
Ticket           418 non-null object
Fare             418 non-null float64
Cabin            91 non-null object
Embarked         418 non-null object
FamilySize       418 non-null int64
IsAlone          418 non-null int64
Title            418 non-null object
FareBin          418 non-null category
AgeBin           418 non-null category
Sex_Code         418 non-null int64
Embarked_Code    418 non-null int64
Title_Code       418 non-null int64
AgeBin_Code      418 non-null int64
FareBin_Code     418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------
Validation Data Distribution: 
 0    0.633971
1    0.366029
Name: Survived, dtype: float64






























































PassengerId Survived
15 907 1
245 1137 0
230 1122 0
280 1172 1
16 908 0
181 1073 1
311 1203 0
389 1281 0
383 1275 1
119 1011 1

Step 7: Optimize and Strategize

Conclusion

Iteration one of the Data Science Framework, seems to converge on 0.77990 submission accuracy. Using the same dataset and different implementation of a decision tree (adaboost, random forest, gradient boost, xgboost, etc.) with tuning does not exceed the 0.77990 submission accuracy. Interesting for this dataset, the simple decision tree algorithm had the best default submission score and with tuning achieved the same best accuracy score.

While no general conclusions can be made from testing a handful of algorithms on a single dataset, there are several observations on the mentioned dataset.

The train dataset has a different distribution than the test/validation dataset and population. This created wide margins between the cross validation (CV) accuracy score and Kaggle submission accuracy score.
Given the same dataset, decision tree based algorithms, seemed to converge on the same accuracy score after proper tuning.
Despite tuning, no machine learning algorithm, exceeded the homemade algorithm. The author will theorize, that for small datasets, a manmade algorithm is the bar to beat.
With that in mind, for iteration two, I would spend more time on preprocessing and feature engineering. In order to better align the CV score and Kaggle score and improve the overall accuracy.

第七步:优化和策略

结论

数据科学框架的迭代之一,似乎收敛于0.77990提交的准确性。使用相同的数据集和不同的决策树(adaboost,随机森林,梯度提升,xgboost等)的实施调整不超过0.77990提交的准确性。这个数据集有趣,简单的决策树算法有最好的默认提交分数和调整达到相同的最佳准确性分数。

译者注:作者标题的99% accuracy其实是一种比喻,一种思路。看懂了文章反而会忘记标题。

尽管在单个数据集上测试少数几种算法,可以得出一般性结论,但是对于所提及的数据集却有几种观察结果。

训练数据集与测试/验证数据集和总体的分布不同。这在交叉验证(CV)准确性得分和Kaggle提交准确性得分之间创造了很大的利润空间。
给定相同的数据集,基于决策树的算法似乎在适当的调整之后以相同的准确度得分收敛。
尽管调整后,没有机器学习算法,超过了自制算法。作者将推理,对于小数据集,人造算法是是可以打败的。
考虑到这一点,对于二次迭代,我会花更多的时间在预处理和特征工程上。为了更好地调整CV分数和Kaggle分数,并提高整体的准确性。

Change Log:

  • 11/22/17 Please note, this kernel is currently in progress, but open to feedback. Thanks!
  • 11/23/17 Cleaned up published notebook and updated through step 3.
  • 11/25/17 Added enhancements to published notebook and started step 4.
  • 11/26/17 Skipped ahead to data model, since this is a published notebook. Accuracy with (very) simple data cleaning and logistic regression is ~82%. Continue to up vote and I will continue to develop this notebook. Thanks!
  • 12/2/17 Updated section 4 with exploratory analysis and section 5 with more classifiers. Improved model to ~85% accuracy.
  • 12/3/17 Update section 4 with improved graphical statistics.
  • 12/7/17 Updated section 5 with Data Science 101 Lesson.
  • 12/8/17 Reorganized section 3 & 4 with cleaner code.
  • 12/9/17 Updated section 5 with model optimization how-tos. Initial competition submission with Decision Tree; will update with better algorithm later.
  • 12/10/17 Updated section 3 & 4 with cleaner code and better datasets.
  • 12/11/17 Updated section 5 with better how-tos.
  • 12/12/17 Cleaned section 5 to prep for hyper-parameter tuning.
  • 12/13/17 Updated section 5 to focus on learning data modeling via decision tree.
  • 12/20/17 Updated section 4 - Thanks @Daniel M. for suggestion to split up visualization code. Started working on section 6 for “super” model.
  • 12/23/17 Edited section 1-5 for clarity and more concise code.
  • 12/24/17 Updated section 5 with random_state and score for more consistent results.
  • 12/31/17 Completed data science framework iteration 1 and added section 7 with conclusion.
    更改日志:
  • 11/17/17请注意,这个内核目前正在进行中,但可以反馈。谢谢!
  • 11/23/17清理已发布的笔记本,并通过步骤3进行更新。
  • 11/25/17增加了已发布笔记本的增强功能,并启动了第4步。
  • 11/26/17跳到数据模型,因为这是一个公布的笔记本。(非常)简单的数据清理和逻辑回归的准确性是〜82%。继续投票,我会继续发展这个笔记本。谢谢!
  • 12/2/17更新第4部分探索性分析和第5部分更多的分类。改进的模型精度达到〜85%。
  • 12/3/17更新第4节,改进图形统计。
  • 12/7/17更新了第5节“数据科学101课”。
  • 12/8/17用更简洁的代码重新组织第3和4部分。
  • 12/9/17更新了第5部分的模型优化指南。使用决策树提交初始竞争; 稍后会用更好的算法更新。
  • 12/10/17用更简洁的代码和更好的数据集更新了第3和第4部分。
  • 12/11/17更新了第5节,更好的方法。
  • 12/12/17清理了第5部分,准备进行超参数调整。
  • 12/13/17更新了第5节,重点介绍了通过决策树学习数据建模。
  • 12/20/17更新了第4部分 - 感谢@Daniel M.建议分割可视化代码。开始研究“超级”模型的第6部分。
  • 12/23/17为了清晰和简洁的代码,编辑了1-5节。
  • 12/24/17更新了第5部分的random_state和得分更一致的结果。
  • 12/31/17完成了数据科学框架第1次迭代,并添加了第7部分的结论。

    Credits

    Programming is all about “borrowing” code, because knife sharpens knife. Nonetheless, I want to give credit, where credit is due.

  • Introduction to Machine Learning with Python: A Guide for Data Scientists by Andreas Müller and Sarah Guido - Machine Learning 101 written by a core developer of sklearn

  • Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics by Nathan Yau - Learn the art and science of data visualization
  • Machine Learning for Dummies by John Mueller and Luca Massaron - Easy to understand for a beginner book, but detailed to actually learn the fundamentals of the topic
1
2