Databases Reference
In-Depth Information
Figure 6-9. RF model—MDS proximity matrix
The plot shows the principal components of the distance matrix—points that are close
together represent data points that are similar to each other. This is one way of showing
outliers that haven't been handled well by the model. Again, we know that the “setosa”
species clusters tightly, whereas “versicolor” and “virginica” tend to overlap.
The remainder of the R script writes the data with a column added to represent the
expected results from the model for us to use in regression testing. Then it writes the
PMML file to capture the model. Take a look at the resulting XML definitions in the
data/iris.rf.xml file:
<MiningModel
modelName= "randomForest_Model"
functionName= "classification"
>
<MiningSchema>
<MiningField name= "species" usageType= "predicted" />
<MiningField name= "sepal_length" usageType= "active" />
<MiningField name= "sepal_width" usageType= "active" />
<MiningField name= "petal_length" usageType= "active" />
<MiningField name= "petal_width" usageType= "active" />
</MiningSchema>
...
Now that we have a PMML model, let's use Pattern to run it. We'll run a regression test
to confirm that the results predicted on Hadoop match those predicted in R as a baseline.
Then we'll calculate a confusion matrix to evaluate the error rates in the model. Again,
a log of a successful run is given in a GitHub gist to compare:
$ rm -rf out
$ hadoop jar build/libs/pattern-examples-*.jar \
 
Search WWH ::




Custom Search