数据分析实践1

2019年1月2日

整个机器学习的流程主要分如下几个主要过程:

  1. 业务理解
  2. 数据分析理解 
  3. 数据预处理
  4. 建模
  5. 模型效果评估
  6. 使用合格的模型做预测

数据分析理解是机器学习建模的基础,目的是分析挖掘数据维度中和预测目标强相关的因素, 剔除不相关或者弱相关的因素, 用于后续的机器学习;

Python ,spark框架中常用dataframe结构处理数据集,下面使用python+银行营销数据做数据分析实践:

1 理解分析数据

1.1 加载数据并理解数据有哪些维度信息:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:/code/dataML/ibank-additional.csv",sep =";")
print(df.head())

打印结果,可以看出维度信息,哪些是数值型,哪些是字符串型

   age          job  marital ... euribor3m nr.employed   y
0   30  blue-collar  married ...     1.313      5099.1  no
1   39     services   single ...     4.855      5191.0  no
2   25     services  married ...     4.962      5228.1  no
3   38     services  married ...     4.959      5228.1  no
4   47       admin.  married ...     4.191      5195.8  no

1.2 理解数据集中 数值型 数据的分布统计情况:

print(df.describe())

打印结果:

  • 总数据量有4119个
  • 数据中客户大部分年龄均方差在40岁,最小18岁,最大88岁,50%的客户年龄>38岁,75%客户年龄>47岁
               age     duration     ...         euribor3m  nr.employed
count  4119.000000  4119.000000     ...       4119.000000  4119.000000
mean     40.113620   256.788055     ...          3.621356  5166.481695
std      10.313362   254.703736     ...          1.733591    73.667904
min      18.000000     0.000000     ...          0.635000  4963.600000
25%      32.000000   103.000000     ...          1.334000  5099.100000
50%      38.000000   181.000000     ...          4.857000  5191.000000
75%      47.000000   317.000000     ...          4.961000  5228.100000
max      88.000000  3643.000000     ...          5.045000  5228.100000

 

 

1.3 分析各个维度信息和 标签列Y的相关性

age 和 y的相关性分析:

sns.countplot(x = "age", hue ="y",data = df, palette = "Blues")
plt.show( )

年龄和Y有一定相关性…年龄到60岁后买理财产品的比例相对较大,但绝对人数较少,32岁左右买理财产品的比例相对较低

marital和 y的相关性分析:

sns.countplot(x = "marital", hue ="y",data = df, palette = "Blues")
plt.show( )

显示已婚的人群中,买理财产品的比例相对比较低,但购买的总人数较多

学历和y值的关系:

sns.countplot(x = "education", hue ="y",data = df, palette = "Blues")
plt.show( )

专家级别的客户总人数不算最多,但是购买比例较高

是否有房贷和y值的关系:(似乎差异不大,弱因子,可以考虑剔除)

是否有个人贷款和y值的关系:

上次联系通信方式和y的关系:

month(上次联系的月份)与y的关系:

day_of_week(上次联系的星期几)与y 的关系:(差异不大,弱因子,可以考虑剔除)

其他维度数据与y相关性分析类似,不赘述

 

1.4 分析数值型数据维度之间的相关性:

print(df.corr())

打印结果:

                     age  duration     ...       euribor3m  nr.employed
age             1.000000  0.041299     ...       -0.015033    -0.041936
duration        0.041299  1.000000     ...       -0.032329    -0.044218
campaign       -0.014169 -0.085348     ...        0.159435     0.161037
pdays          -0.043425 -0.046998     ...        0.301478     0.381983
previous        0.050931  0.025724     ...       -0.458851    -0.514853
emp.var.rate   -0.019192 -0.028848     ...        0.970308     0.897173
cons.price.idx -0.000482  0.016672     ...        0.657159     0.472560
cons.conf.idx   0.098135 -0.034745     ...        0.276595     0.107054
euribor3m      -0.015033 -0.032329     ...        1.000000     0.942589
nr.employed    -0.041936 -0.044218     ...        0.942589     1.000000

可视化展现:

f,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

矩阵内容斜对称,数值越大表示越正相关,否则负相关

从上图结果很直观的发现几个相关性较大的维度:

euribor3m(3个月的利率) 和emp.var.rate(就业率)

nr.employed(就业人数)和emp.var.rate

nr.employed和euribor3m

相关性特别强的因子之间, 我们也许只要选择其中的一个即可,可以减少机器学习算法的数据特征数量;

由于标签列Y目前不是数值类型所以,无法直接使用heatmap发现其他列与它的相关性, 可以先把Y和其他列数据先数值化后再生成热力图再观察一下:)

 

 

 

1.5 分析 非数值型数据的分布情况:

print(df.describe(include=['O']))

打印结果

a. 一共有4119条数据

b. Job 没有空值,且它有12种类型,其中最多的类型是admin 有1012个约25%

c. marital 也没有空值,它有4种类型,其中最多的是married类型,有2509个约60%

d. education也没有空值,它有8种类型,其中类型最多的是university.degree,有1264个约31%

           job  marital          education  ...  day_of_week     poutcome     y
count     4119     4119               4119  ...         4119         4119  4119
unique      12        4                  8  ...            5            3     2
top     admin.  married  university.degree  ...          thu  nonexistent    no
freq      1012     2509               1264  ...          860         3523  3668

[4 rows x 11 columns]

 

2 数据预处理

主要包括空值处理, 极端值剔除,数值化,标准化等操作

按照不同类型的数据分成3类: 数值型, 2元分类型, 多元分类型 (方便后续分别做数值化处理)

    numeric_attrs = ['age', 'duration', 'campaign', 'pdays', 'previous',
                     'emp.var.rate', 'cons.price.idx', 'cons.conf.idx',
                     'euribor3m', 'nr.employed' ]
    bin_attrs = ['default', 'housing', 'loan']
    cate_attrs = ['poutcome', 'education', 'job', 'marital',
                  'contact', 'month', 'day_of_week']

 

2.1 极端值剔除:

在数据可视化分析时, 已经能发现一些极端值, 比如age中的年龄<20,>80岁的场景, 可以考虑剔除;

 

2.2 数值化&标准化处理

多元分类类型数值化处理函数:

def encode_cate_attrs(data, cate_attrs):
    data = encode_edu_attrs(data)#education是有序分类变量编码,单独处理
    cate_attrs.remove('education')

    #无序分类变量编码
    for i in cate_attrs:
        dummies_df = pd.get_dummies(data[i])#get_dummies 是利用pandas实现one hot encode的方式
        dummies_df = dummies_df.rename(columns=lambda x: i + '_' + str(x))
        data = pd.concat([data, dummies_df], axis=1)
        data = data.drop(i, axis=1)
    return data

#education是有序分类变量编码
def encode_edu_attrs(data):
    values = ["illiterate", "basic.4y", "basic.6y", "basic.9y",
              "high.school", "professional.course", "university.degree"]
    levels = range(1, len(values) + 1)
    dict_levels = dict(zip(values, levels))
    for v in values:
        data.loc[data['education'] == v, 'education'] = dict_levels[v]
    return data

二元分类数值化处理函数:

def encode_bin_attrs(data, bin_attrs):
    for i in bin_attrs:
        data.loc[data[i] == 'no', i] = 0
        data.loc[data[i] == 'yes', i] = 1
    return data

数值型数据向量化&标准化处理函数:

pandas.qcut()函数来离散化连续数据(Spark中提供bucketize实现),它使用分位数对数据进行划分(分箱: bining),可以得到大小基本相等的箱子(bin),以区间形式表示。然后使用pandas.factorize()函数将区间转为数值

比如,age 我们并不需要所有的离散值分析, 只需要分几个年龄阶段就可以了

def trans_num_attrs(data, numeric_attrs):
    bining_num = 10
    bining_attr = 'age'
    data[bining_attr] = pd.qcut(data[bining_attr], bining_num)
    data[bining_attr] = pd.factorize(data[bining_attr])[0] + 1

    for i in numeric_attrs:
        scaler = preprocessing.StandardScaler()
        data[i] = scaler.fit_transform(data[i].values.reshape(-1,1))
    return data

先把所有分类数据数值化,再把所有数值化后的数据通过trans_num_attrs 向量化和标准化处理:

    data = encode_cate_attrs(data, cate_attrs)
    data = encode_bin_attrs(data, bin_attrs)
    data = trans_num_attrs(data, numeric_attrs)
    data['y'] = data['y'].map({'no': 0, 'yes': 1}).astype(int)

2.3 空数据处理:

空值处理一般有3种方式:

a.  赋值默认值

b. 删除空值多的维度

c.  使用已有值预测出空值

data['age'] = data.age.fillna(data.age.mean())

 

一般一个维度的数据缺失率>=15% 就会放弃使用这个维度的数据,以免数据失真

数值类型的数据判断:

total = df.isnull().sum()
percent = (df.isnull().sum()/df.isnull().count())
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data)

打印结果如下,显示各维度都无空数据

                Total  Percent
age                 0      0.0
campaign            0      0.0
cons.conf.idx       0      0.0
cons.price.idx      0      0.0
contact             0      0.0
day_of_week         0      0.0
default             0      0.0
duration            0      0.0
education           0      0.0
emp.var.rate        0      0.0
euribor3m           0      0.0
housing             0      0.0
job                 0      0.0
loan                0      0.0
marital             0      0.0
month               0      0.0
nr.employed         0      0.0
pdays               0      0.0
poutcome            0      0.0
previous            0      0.0
y                   0      0.0

删除缺失数据较多的列

#dealing with missing data
df= df.drop((missing_data[missing_data['Percent'] > 0.15]).index,1)
df.isnull().sum().max() #最后确认没有missing数据

非数值类维度是否有空值,比如”unknown”:

for i in df.columns:
    if type(df[i][0]) is str:
        print("unknown value count in "+i+":\t"+str(df[df[i]=="unknown"]['y'].count()))

打印结果,发现defalut维度缺失的量比较大,  803/4119=0.195 , 已经>15%可以考虑剔除该维度数据:

unknown value count in job:	39
unknown value count in marital:	11
unknown value count in education:	167
unknown value count in default:	803
unknown value count in housing:	105
unknown value count in loan:	105
unknown value count in contact:	0
unknown value count in month:	0
unknown value count in day_of_week:	0
unknown value count in poutcome:	0
unknown value count in y:	0

如果缺失数据较多的维度比较重要删除也不太适合, 也有人使用数据完整的行作为训练集,以此来预测缺失值,变量housing,loan,education和default的缺失值可以采取此法,

由于模型只能处理数值变量,需要先将分类变量数值化,然后进行预测,比如使用随机森林预测缺失值:

def fill_unknown(data, bin_attrs, cate_attrs, numeric_attrs):
    fill_attrs = []
    for i in bin_attrs + cate_attrs:
        if data[data[i] == 'unknown']['y'].count() < 500:
            # delete col containing unknown
            data = data[data[i] != 'unknown']
        else:
            fill_attrs.append(i)

    data = encode_cate_attrs(data, cate_attrs)
    data = encode_bin_attrs(data, bin_attrs)
    data = trans_num_attrs(data, numeric_attrs)
    data['y'] = data['y'].map({'no': 0, 'yes': 1}).astype(int)
    for i in fill_attrs:
        test_data = data[data[i] == 'unknown']
        testX = test_data.drop(fill_attrs, axis=1)
        train_data = data[data[i] != 'unknown']
        trainY = train_data[i]
        trainX = train_data.drop(fill_attrs, axis=1)
        test_data[i] = train_predict_unknown(trainX, trainY.astype('int'), testX)
        data = pd.concat([train_data, test_data])
    return data

def train_predict_unknown(trainX, trainY, testX):
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainX, trainY)
    test_predictY = forest.predict(testX).astype(int)
    return pd.DataFrame(test_predictY, index=testX.index)

data = fill_unknown(data, bin_attrs, cate_attrs, numeric_attrs)

预处理后的部分数据样例:

age,education,default,housing,loan,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,poutcome_failure,poutcome_nonexistent,poutcome_success,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,marital_divorced,marital_married,marital_single,contact_cellular,contact_telephone,month_apr,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed
29528,-1.5407029621226411,5,0,1,0,-0.9616741473849848,0.8792728510796384,0.19471978718933916,-0.349340630297601,-1.1976943722378013,-0.8638900037329893,-1.4230792758363522,-1.2768324248035627,-0.9402874889310989,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
11077,-0.8373901096042032,5,0,0,0,-0.3946898019668333,0.15674760650699962,0.19471978718933916,-0.349340630297601,0.8403751233526144,1.5386637396382556,-0.27756645910974154,0.7740270002200575,0.8456503392074235,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
4680,-0.4857336833449843,4,0,0,0,-0.5682564383193286,0.8792728510796384,0.19471978718933916,-0.349340630297601,0.6493061081410131,0.7245609963807991,0.8895598069890682,0.7140637221198453,0.33202015762649456,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1

 

之后可以保存数据用于机器学习了,此处不再赘述

没有评论

发表评论

电子邮件地址不会被公开。 必填项已用*标注