机器学习实践–Titanic 乘客生还率预测

2018年4月5日

1.业务分析:

根据乘客的各维度特征预Titanic乘客生还概率

框架选择: 数据分析–pandas

机器学习–sklearn

 

2.数据分析:

导入数据分析维度和类型:

df = pd.read_csv('D:/code/sparkProject/sparkInput/titanic-data.csv')
print(train_df.head())

结果显示数据维度和类型:

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

类型和空值情况分析:

print(df.info())

结果显示Age,cabin,Embarked列有空值:

cabin的空值率>50% , 需要删除

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object

 

数值型数据分布分析

print(df.describe())

结果显示Age,SibSp,Parch,Fare分布似乎呈偏右分布(不利于模型训练)

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200 

非数值型数据分布:

print(df.describe(include=['O']))

结果显示: Name,ticket,Enbarked 是离散分类变量, Sex是2元分类变量 ;

且Embarked 分布也不太均匀

name, ticket 数值太离散, 而且不是连续分类值, 很难与标签有任何关系, 可以考虑删除

                              Name   Sex Ticket        Cabin Embarked
count                          891   891    891          204      889
unique                         891     2    681          147        3
top     Berriman, Mr. William John  male   1601  C23 C25 C27        S
freq                             1   577      7            4      644

 

各特征与标签的相关性分析:

少量枚举型的维度可以直接使用列表分析,比如: Pclass,sex,sibsp,Parch

print(df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))

相关性结果分别如下图:

   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363

 

      Sex  Survived
0  female  0.742038
1    male  0.188908
   SibSp  Survived
1      1  0.535885
2      2  0.464286
0      0  0.345395
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000
   Parch  Survived
3      3  0.600000
1      1  0.550847
2      2  0.500000
0      0  0.343658
5      5  0.200000
4      4  0.000000
6      6  0.000000

 

对于枚举量很大的数据分析, 使用分段分析+可视化,比如: age

g = sns.FacetGrid(df, col='Survived')
g.map(plt.hist, 'Age', bins=20)
plt.show()

删除与标签无关的特征列

之前分析可以删除的特征列有cabin, Ticket,PassengerId

df = df.drop(['Ticket', 'Cabin','PassengerId'], axis=1)

Name似乎也不和特征列相关, 是否可以删除?

看数据中name 中包含称谓信息, 称谓可能暗示一些信息和生存特征相关?

p1 = r" ([A-Za-z]+)\."
pattern1 = re.compile(p1)
for index in df.index:
    a=re.findall(pattern1,df.loc[index, 'Name'])[0]
    df.loc[index, 'Title']= a

结果打印:

Sex       female  male
Title                 
Capt           0     1
Col            0     2
Countess       1     0
Don            0     1
Dr             1     6
Jonkheer       0     1
Lady           1     0
Major          0     2
Master         0    40
Miss         182     0
Mlle           2     0
Mme            1     0
Mr             0   517
Mrs          125     0
Ms             1     0
Rev            0     6
Sir            0     1

分类简化一下,把数量较少的’Lady’, ‘Countess’, ‘Capt’, ‘Col’,’Don’, ‘Dr’, ‘Major’, ‘Rev’, ‘Sir’, ‘Jonkheer’, ‘Dona’这些称谓作为Title的一个分类Rare

list=['Lady', 'Countess', 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
for index in df.index:
    if df.loc[index, 'Title'] in list:
        df.loc[index, 'Title'] ='Rare'

    df.loc[index, 'Title'] = df.loc[index, 'Title'].replace('Mlle', 'Miss')
    df.loc[index, 'Title'] = df.loc[index, 'Title'].replace('Ms', 'Miss')
    df.loc[index, 'Title'] = df.loc[index, 'Title'].replace('Mme', 'Mrs')

print(df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

整体新列Title 特征与标签的相关性如下, 还是比较相关的:

    Title  Survived
0  Master  0.575000
1    Miss  0.702703
2      Mr  0.156673
3     Mrs  0.793651
4    Rare  0.347826

使用Title列替换Name列特征

df = df.drop(['Name'], axis=1)

 

非数值特征数值化

Title特征列数值化:

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for index in df.index:
    df.loc[index, 'Title'] = title_mapping.get(df.loc[index, 'Title'])

print(df.head(3))

 

SibSp:兄弟姐妹/配偶的数量Parch:父母/孩子的数量 考虑压缩成一个Family 成员数量

for index in df.index:
    df.loc[index, 'FamilySize'] = df.loc[index, 'SibSp'] + df.loc[index, 'Parch'] + 1

print(df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False))

观察FamilySize 和标签之间的关系:

FamilySize Survived
4 4.0 0.724138
3 3.0 0.578431
2 2.0 0.552795
7 7.0 0.333333
1 1.0 0.303538
5 5.0 0.200000
6 6.0 0.136364
0 0.0 0.000000
8 8.0 0.000000
9 11.0 0.000000

考虑到FamilySize 个数可能还是会有变化, 再次变换FamilySize  为是否单独出行isAlone的二元分类变量更好

for index in df.index:  
    df.loc['IsAlone'] = 0
    a = df.loc[index, 'FamilySize']
    if a==1:
        df.loc[index, 'IsAlone']=1
    else:
        df.loc[index, 'IsAlone']=0

print(df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean())

IsAlone特征和标签相关性如下, 有一定差异,可以保留:

   IsAlone  Survived
0      0.0  0.504225
1      1.0  0.303538

删除掉被替代的特征列:

df = df.drop(['SibSp','Parch','FamilySize'], axis=1)

print(df.head(2))

新的特征维度如下:

   Survived  Pclass  Sex  Age     Fare Embarked  Title  IsAlone
0         0       3    0  1.0   7.2500        S      1      0.0
1         1       1    1  2.0  71.2833        C      3      0.0

 

Embarked特征数据数值化:

embarked_mapping={'S': 0, 'C': 1, 'Q': 2}
for index in df.index:
    df.loc[index, 'Embarked'] = embarked_mapping.get(df.loc[index, 'Embarked'])

 

Fare 特征数据数值化:

Fare是连续特征, 需要分段处理,之前分析结果看到Fare的均值是32, 75%的场景在31左右

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200 

考虑把Fare的值分为4段

df['FareBand'] = pd.qcut(df['Fare'], 4)
print(df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True))

各分段区间与标签的相关性:

          FareBand  Survived
0  (-0.001, 7.896]  0.196429
1  (7.896, 14.454]  0.303571
2   (14.454, 31.0]  0.454955
3  (31.0, 512.329]  0.581081

数值化Fare数据,并删除临时特征列FareBand:

for index in df.index:
    a = df.loc[index, 'Fare']
    if a <= 7.91:
        df.loc[index, 'Fare'] = 0
    elif (a > 7.91 and a<= 14.454 ):
        df.loc[index, 'Fare'] = 1
    elif (a > 14.454 and a<= 31 ):
        df.loc[index, 'Fare'] = 2
    elif (a > 31 ):
        df.loc[index, 'Fare'] =3

df = df.drop(['FareBand'], axis=1)
df = df.dropna()
print(df.head(2))

新的特征列结构:

   Survived  Pclass  Sex  Age  Fare Embarked  Title  IsAlone
0         0       3    0  1.0   0.0        0      1      0.0
1         1       1    1  2.0   3.0        1      3      0.0

 

选择模型训练

使用随机森林模型, 数据划分比例3:1

X_train = df[:600].drop("Survived", axis=1)
X_test = df[600:].drop("Survived", axis=1)
Y_train = df[:600].Survived


random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

 

模型预测&效果判断

Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(acc_random_forest)

预测精准度: 87.17

 

 

没有评论

发表评论

邮箱地址不会被公开。 必填项已用*标注