1.业务分析:
根据乘客的各维度特征预Titanic乘客生还概率
框架选择: 数据分析–pandas
机器学习–sklearn
2.数据分析:
导入数据分析维度和类型:
df = pd.read_csv('D:/code/sparkProject/sparkInput/titanic-data.csv') print(train_df.head())
结果显示数据维度和类型:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
类型和空值情况分析:
print(df.info())
结果显示Age,cabin,Embarked列有空值:
cabin的空值率>50% , 需要删除
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
数值型数据分布分析
print(df.describe())
结果显示Age,SibSp,Parch,Fare分布似乎呈偏右分布(不利于模型训练)
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
非数值型数据分布:
print(df.describe(include=['O']))
结果显示: Name,ticket,Enbarked 是离散分类变量, Sex是2元分类变量 ;
且Embarked 分布也不太均匀
name, ticket 数值太离散, 而且不是连续分类值, 很难与标签有任何关系, 可以考虑删除
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Berriman, Mr. William John male 1601 C23 C25 C27 S
freq 1 577 7 4 644
各特征与标签的相关性分析:
少量枚举型的维度可以直接使用列表分析,比如: Pclass,sex,sibsp,Parch
print(df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))
相关性结果分别如下图:
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
Sex Survived
0 female 0.742038
1 male 0.188908
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000
对于枚举量很大的数据分析, 使用分段分析+可视化,比如: age
g = sns.FacetGrid(df, col='Survived') g.map(plt.hist, 'Age', bins=20) plt.show()
删除与标签无关的特征列
之前分析可以删除的特征列有cabin, Ticket,PassengerId
df = df.drop(['Ticket', 'Cabin','PassengerId'], axis=1)
Name似乎也不和特征列相关, 是否可以删除?
看数据中name 中包含称谓信息, 称谓可能暗示一些信息和生存特征相关?
p1 = r" ([A-Za-z]+)\." pattern1 = re.compile(p1) for index in df.index: a=re.findall(pattern1,df.loc[index, 'Name'])[0] df.loc[index, 'Title']= a
结果打印:
Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
分类简化一下,把数量较少的’Lady’, ‘Countess’, ‘Capt’, ‘Col’,’Don’, ‘Dr’, ‘Major’, ‘Rev’, ‘Sir’, ‘Jonkheer’, ‘Dona’这些称谓作为Title的一个分类Rare
list=['Lady', 'Countess', 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
for index in df.index:
if df.loc[index, 'Title'] in list:
df.loc[index, 'Title'] ='Rare'
df.loc[index, 'Title'] = df.loc[index, 'Title'].replace('Mlle', 'Miss')
df.loc[index, 'Title'] = df.loc[index, 'Title'].replace('Ms', 'Miss')
df.loc[index, 'Title'] = df.loc[index, 'Title'].replace('Mme', 'Mrs')
print(df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())
整体新列Title 特征与标签的相关性如下, 还是比较相关的:
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Rare 0.347826
使用Title列替换Name列特征
df = df.drop(['Name'], axis=1)
非数值特征数值化
Title特征列数值化:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for index in df.index:
df.loc[index, 'Title'] = title_mapping.get(df.loc[index, 'Title'])
print(df.head(3))
SibSp:兄弟姐妹/配偶的数量Parch:父母/孩子的数量 考虑压缩成一个Family 成员数量
for index in df.index: df.loc[index, 'FamilySize'] = df.loc[index, 'SibSp'] + df.loc[index, 'Parch'] + 1 print(df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False))
观察FamilySize 和标签之间的关系:
FamilySize Survived
4 4.0 0.724138
3 3.0 0.578431
2 2.0 0.552795
7 7.0 0.333333
1 1.0 0.303538
5 5.0 0.200000
6 6.0 0.136364
0 0.0 0.000000
8 8.0 0.000000
9 11.0 0.000000
考虑到FamilySize 个数可能还是会有变化, 再次变换FamilySize 为是否单独出行isAlone的二元分类变量更好
for index in df.index:
df.loc['IsAlone'] = 0
a = df.loc[index, 'FamilySize']
if a==1:
df.loc[index, 'IsAlone']=1
else:
df.loc[index, 'IsAlone']=0
print(df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean())
IsAlone特征和标签相关性如下, 有一定差异,可以保留:
IsAlone Survived
0 0.0 0.504225
1 1.0 0.303538
删除掉被替代的特征列:
df = df.drop(['SibSp','Parch','FamilySize'], axis=1) print(df.head(2))
新的特征维度如下:
Survived Pclass Sex Age Fare Embarked Title IsAlone
0 0 3 0 1.0 7.2500 S 1 0.0
1 1 1 1 2.0 71.2833 C 3 0.0
Embarked特征数据数值化:
embarked_mapping={'S': 0, 'C': 1, 'Q': 2} for index in df.index: df.loc[index, 'Embarked'] = embarked_mapping.get(df.loc[index, 'Embarked'])
Fare 特征数据数值化:
Fare是连续特征, 需要分段处理,之前分析结果看到Fare的均值是32, 75%的场景在31左右
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
考虑把Fare的值分为4段
df['FareBand'] = pd.qcut(df['Fare'], 4) print(df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True))
各分段区间与标签的相关性:
FareBand Survived
0 (-0.001, 7.896] 0.196429
1 (7.896, 14.454] 0.303571
2 (14.454, 31.0] 0.454955
3 (31.0, 512.329] 0.581081
数值化Fare数据,并删除临时特征列FareBand:
for index in df.index:
a = df.loc[index, 'Fare']
if a <= 7.91:
df.loc[index, 'Fare'] = 0
elif (a > 7.91 and a<= 14.454 ):
df.loc[index, 'Fare'] = 1
elif (a > 14.454 and a<= 31 ):
df.loc[index, 'Fare'] = 2
elif (a > 31 ):
df.loc[index, 'Fare'] =3
df = df.drop(['FareBand'], axis=1)
df = df.dropna()
print(df.head(2))
新的特征列结构:
Survived Pclass Sex Age Fare Embarked Title IsAlone
0 0 3 0 1.0 0.0 0 1 0.0
1 1 1 1 2.0 3.0 1 3 0.0
选择模型训练
使用随机森林模型, 数据划分比例3:1
X_train = df[:600].drop("Survived", axis=1) X_test = df[600:].drop("Survived", axis=1) Y_train = df[:600].Survived random_forest = RandomForestClassifier(n_estimators=100) random_forest.fit(X_train, Y_train)
模型预测&效果判断
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(acc_random_forest)
预测精准度: 87.17
没有评论