Beyond MapReduce - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

## report the measures of model fitness

print ( fit $ importance )

print ( fit )

print ( table ( iris_test $ species , predict ( fit , iris_test , type = "class" )))

## visualize results

plot ( fit , log = "y" , main = "Random Forest" )

varImpPlot ( fit )

MDSplot ( fit , iris_full $ species )

## export PMML + test data

out <- iris_full

out $ predict <- predict ( fit , out , type = "class" )

dat_folder <- './data'

tsv <- paste ( dat_folder , "iris.rf.tsv" , sep = "/" )

write.table ( out , file = tsv , quote = FALSE , sep = "\t" , row.names = FALSE )

saveXML ( pmml ( fit ), file = paste ( dat_folder , "iris.rf.xml" , sep = "/" ))

The R script loads the required packages, along with the Iris data set. It splits the data

set into two parts: iris_train and iris_test . Then it trains a Random Forest model

using the iris_train part, using the petal and sepal measures to predict species.

The results of this model creation get evaluated and visualized in a few different ways.

First we have a few printed reports about the fitness of the model. One well-known

aspect of the Iris data set is that the “setosa” species is relatively easy to predict, whereas

the other two species have overlap, which confuses predictive models. We see that in

the results, but overall there is an estimated 5% error rate:

OOB estimate of error rate: 5 %

Confusion matrix:

setosa versicolor virginica class.error

setosa 32 0 0 0.00000000

versicolor 0 26 2 0.07142857

virginica 0 3 37 0.07500000

The chart in Figure 6-7 shows error rate versus the number of trees. One of the param‐

eters for training an RF model is to select the number of trees in the forest. As that

parameter approaches 50 trees, decrease in error levels out.

Search WWH ::

Custom Search

Home