Flink也支持ML库,但不太成熟:
- flink 比spark 支持的ML算法少很多
- flink中只有dataset 类型的数据才能使用ML,datastream类型数据没有专门的ML库;
- flink 中dateset 不能转换成dataframe结构…特征数据处理感觉不是很方便
- flink dateset ML库中的算法类似乎没有提供模型评估方法
一个简单的线性回归算法实践:
1.数据
数据还是著名的鸢尾花卉数据集,包含了5个属性:
& Sepal.Length(花萼长度),单位是cm;
& Sepal.Width(花萼宽度),单位是cm;
& Petal.Length(花瓣长度),单位是cm;
& Petal.Width(花瓣宽度),单位是cm;
& 种类:Iris Setosa(山鸢尾)、Iris Versicolour(杂色鸢尾),以及Iris Virginica(维吉尼亚鸢尾)
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5.0,3.6,1.4,0.2,0
2. 算法代码:
使用线性回归算法预测
import org.apache.flink.api.scala._
import org.apache.flink.ml._
import org.apache.flink.ml.common.LabeledVector
import org.apache.flink.ml.math.DenseVector
import org.apache.flink.ml.math.Vector
import org.apache.flink.ml.preprocessing.Splitter
import org.apache.flink.ml.regression.MultipleLinearRegression
object Job {
def main(args: Array[String]) {
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
val iriscsv = env.readCsvFile[(String, String, String, String, String)]("D:/code/dataML/iris.csv")
val irisLV = iriscsv
.map { tuple =>
val list = tuple.productIterator.toList
val numList = list.map(_.asInstanceOf[String].toDouble)
LabeledVector(numList(4), DenseVector(numList.take(4).toArray))
}
// irisLV.print
// val trainTestData = Splitter.trainTestSplit(irisLV)
val trainTestData = Splitter.trainTestSplit(irisLV, .6, true)
val trainingData: DataSet[LabeledVector] = trainTestData.training
val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector)
testingData.print()
val mlr = MultipleLinearRegression()
.setStepsize(1.0)
.setIterations(5)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
// The fitted model can now be used to make predictions
val predictions = mlr.predict(testingData)
predictions.print()
//然后就没了..没有找到模型评估函数
}
}
部分预测结果打印
(DenseVector(5.1, 3.7, 1.5, 0.4),6.591026288954913E7)
(DenseVector(6.1, 2.8, 4.7, 1.2),8.923309068156719E7)
(DenseVector(5.5, 4.2, 1.4, 0.2),7.02162215995609E7)
(DenseVector(5.8, 2.6, 4.0, 1.2),8.219699757702138E7)
(DenseVector(5.8, 2.8, 5.1, 2.4),9.11540642945152E7)
(DenseVector(5.0, 3.5, 1.3, 0.3),6.303406202577288E7)
(DenseVector(6.8, 3.2, 5.9, 2.3),1.0496233191523147E8)
(DenseVector(5.1, 3.3, 1.7, 0.5),6.557530949933831E7)
(DenseVector(5.2, 4.1, 1.5, 0.1),6.778513200166823E7)
没有评论