接下来,我们使用测试 DataFrame 来测量模型的准确性,测试 DataFrame 是从原始 DataFrame 随机分割的数据,占原始 DataFrame 的 20%,且未用于训练。
在以下代码中,我们在流程模型上调用转换,此操作将依照流程步骤将测试 DataFrame 传入特征提取阶段,通过由模型调整选出的随机森林模型进行估测,然后将预测结果返回到新的 DataFrame 列。
val predictions = pipelineModel.transform(testData)
predictions.select("prediction", "medhvalue").show(5)
result:
+------------------+---------+
| prediction|medhvalue|
+------------------+---------+
|104349.59677450571| 94600.0|
| 77530.43231856065| 85800.0|
|111369.71756877871| 90100.0|
| 97351.87386020401| 82800.0|
+------------------+---------+
With the predictions and labels from the test data, we can now evaluate the model. To evaluate the linear regression model, you measure how close the predictions values are to the label values. The error in a prediction, shown by the green lines below, is the difference between the prediction (the regression line Y value) and the actual Y value, or label. (Error = prediction-label)。
平均绝对误差 (MAE) 是标签值与模型预测值之间的平均绝对差值。绝对值会消除所有负号。
MAE = sum(absolute(prediction-label)) / 观察次数)。
The Mean Square Error (MSE) is the sum of the squared errors divided by the number of observations. The squaring removes any negative signs and also gives more weight to larger differences. (MSE = sum(squared(prediction-label)) / 观察次数)。
均方根误差 (RMSE) 是 MSE 的平方根。RMSE 是预测误差的标准偏差。误差表示的是标签数据点距回归线的距离,而 RMSE 则表示误差的分散程度。
The following code example uses the DataFrame withColumn transformation, to add a column for the error in prediction: error=prediction-medhvalue。然后,我们显示预测值、房价中值和误差的汇总统计信息(以千美元计)。
predictions = predictions.withColumn("error",
col("prediction")-col("medhvalue"))
predictions.select("prediction", "medhvalue", "error").show
result:
+------------------+---------+-------------------+
| prediction|medhvalue| error|
+------------------+---------+-------------------+
| 104349.5967745057| 94600.0| 9749.596774505713|
| 77530.4323185606| 85800.0| -8269.567681439352|
| 101253.3225967887| 103600.0| -2346.677403211302|
+------------------+---------+-------------------+
predictions.describe("prediction", "medhvalue", "error").show
result:
+-------+-----------------+------------------+------------------+
|summary| prediction| medhvalue| error|
+-------+-----------------+------------------+------------------+
| count| 4161| 4161| 4161|
| mean|206307.4865123929|205547.72650805095| 759.7600043416329|
| stddev|97133.45817381598|114708.03790345002| 52725.56329678355|
| min|56471.09903814694| 26900.0|-339450.5381565819|
| max|499238.1371374392| 500001.0|293793.71945819416|
+-------+-----------------+------------------+------------------+
以下代码示例使用 Spark RegressionEvaluator,计算预测 DataFrame 的 MAE,并返回 36636.35(千美元)。
val maevaluator = new RegressionEvaluator()
.setLabelCol("medhvalue")
.setMetricName("mae")
val mae = maevaluator.evaluate(predictions)
result:
mae: Double = 36636.35
以下代码示例使用 Spark RegressionEvaluator,计算预测 DataFrame 的 RMSE,并返回 52724.70。
val evaluator = new RegressionEvaluator()
.setLabelCol("medhvalue")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
result:
rmse: Double = 52724.70