flink实践–dataset-ML

2018年4月16日

Flink也支持ML库,但不太成熟:

  1. flink 比spark 支持的ML算法少很多
  2. flink中只有dataset 类型的数据才能使用ML,datastream类型数据没有专门的ML库;
  3. flink 中dateset 不能转换成dataframe结构…特征数据处理感觉不是很方便
  4. flink dateset  ML库中的算法类似乎没有提供模型评估方法

 

一个简单的线性回归算法实践:

1.数据

数据还是著名的鸢尾花卉数据集,包含了5个属性:

& Sepal.Length(花萼长度),单位是cm;
& Sepal.Width(花萼宽度),单位是cm;
& Petal.Length(花瓣长度),单位是cm;
& Petal.Width(花瓣宽度),单位是cm;
& 种类:Iris Setosa(山鸢尾)、Iris Versicolour(杂色鸢尾),以及Iris Virginica(维吉尼亚鸢尾)
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5.0,3.6,1.4,0.2,0

 

2. 算法代码:
使用线性回归算法预测
import org.apache.flink.api.scala._
import org.apache.flink.ml._
import org.apache.flink.ml.common.LabeledVector
import org.apache.flink.ml.math.DenseVector
import org.apache.flink.ml.math.Vector
import org.apache.flink.ml.preprocessing.Splitter
import org.apache.flink.ml.regression.MultipleLinearRegression

object Job {
  def main(args: Array[String]) {
    // set up the execution environment
    val env = ExecutionEnvironment.getExecutionEnvironment

    val iriscsv = env.readCsvFile[(String, String, String, String, String)]("D:/code/dataML/iris.csv")
    val irisLV = iriscsv
      .map { tuple =>
        val list = tuple.productIterator.toList
        val numList = list.map(_.asInstanceOf[String].toDouble)
        LabeledVector(numList(4), DenseVector(numList.take(4).toArray))
      }

    //  irisLV.print
    // val trainTestData = Splitter.trainTestSplit(irisLV)
    val trainTestData = Splitter.trainTestSplit(irisLV, .6, true)
    val trainingData: DataSet[LabeledVector] = trainTestData.training

    val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector)

    testingData.print()

    val mlr = MultipleLinearRegression()
      .setStepsize(1.0)
      .setIterations(5)
      .setConvergenceThreshold(0.001)

    mlr.fit(trainingData)


    // The fitted model can now be used to make predictions
    val predictions = mlr.predict(testingData)

    predictions.print()
    //然后就没了..没有找到模型评估函数
  }
}

部分预测结果打印

(DenseVector(5.1, 3.7, 1.5, 0.4),6.591026288954913E7)
(DenseVector(6.1, 2.8, 4.7, 1.2),8.923309068156719E7)
(DenseVector(5.5, 4.2, 1.4, 0.2),7.02162215995609E7)
(DenseVector(5.8, 2.6, 4.0, 1.2),8.219699757702138E7)
(DenseVector(5.8, 2.8, 5.1, 2.4),9.11540642945152E7)
(DenseVector(5.0, 3.5, 1.3, 0.3),6.303406202577288E7)
(DenseVector(6.8, 3.2, 5.9, 2.3),1.0496233191523147E8)
(DenseVector(5.1, 3.3, 1.7, 0.5),6.557530949933831E7)
(DenseVector(5.2, 4.1, 1.5, 0.1),6.778513200166823E7)

没有评论

发表评论

邮箱地址不会被公开。 必填项已用*标注