Beyond MapReduce - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Figure 6-9. RF model—MDS proximity matrix

The plot shows the principal components of the distance matrix—points that are close

together represent data points that are similar to each other. This is one way of showing

outliers that haven't been handled well by the model. Again, we know that the “setosa”

species clusters tightly, whereas “versicolor” and “virginica” tend to overlap.

The remainder of the R script writes the data with a column added to represent the

expected results from the model for us to use in regression testing. Then it writes the

PMML file to capture the model. Take a look at the resulting XML definitions in the

data/iris.rf.xml file:

<MiningModel

modelName= "randomForest_Model"

functionName= "classification"

>

</MiningSchema>

...

Now that we have a PMML model, let's use Pattern to run it. We'll run a regression test

to confirm that the results predicted on Hadoop match those predicted in R as a baseline.

Then we'll calculate a confusion matrix to evaluate the error rates in the model. Again,

a log of a successful run is given in a GitHub gist to compare:

$ rm -rf out

$ hadoop jar build/libs/pattern-examples-*.jar \

Search WWH ::

Custom Search

Home