Beyond MapReduce - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Let's create a model in R, then export it as PMML, and run that model on Hadoop. The

following example uses a well-known public domain data set called Iris , which is based

on a 1936 botanical study of three species of Iris flower. Look in data/iris.rf.tsv for an

example of this data:

sepal_length sepal_width petal_length petal_width species predict

5.1 3.5 1.4 0.2 setosa setosa

4.9 3.0 1.4 0.2 setosa setosa

5.6 2.5 3.9 1.1 versicolor versicolor

5.9 3.2 4.8 1.8 versicolor virginica

6.3 3.3 6.0 2.5 virginica virginica

4.9 2.5 4.5 1.7 virginica versicolor

Next, we'll create a predictive model using a machine learning algorithm called Random

Forest (RF) . Random Forest is an ensemble learning method based on using a statistical

technique called “bagging” with decision trees. The general idea is that one decision tree

is probably never enough to capture the possible variations in a large data set. Instead,

we create a collection of decision trees to help explain the various edge cases while

avoiding overfitting.

In this example, the RF model uses flower measurements such as petal length to predict

the iris species. The Iris data set is particularly interesting in statistics because it is prov‐

ably impossible to predict all the edge cases correctly using simple linear regression

methods. That presents an excellent use case for RF. The algorithm gets used widely for

this reason in domains that have lots of important edge cases: for example, in finance

for anti-fraud detection, and in astrophysics for detecting cosmological anomalies.

Take a look at the source code in examples/r/pmml_models.R , in particular the section

that handles RF modeling. Here is an R script for just that model, based on the Random

Forest implementation in R :

install.packages ( "pmml" )

install.packages ( "randomForest" )

library ( pmml )

library ( randomForest )

require ( graphics )

## split data into test and train sets

data ( iris )

iris_full <- iris

colnames ( iris_full ) <-

c ( "sepal_length" , "sepal_width" , "petal_length" , "petal_width" , "species" )

idx <- sample ( 150 , 100 )

iris_train <- iris_full [ idx ,]

iris_test <- iris_full [ - idx ,]

## train a Random Forest model

f <- as.formula ( "as.factor(species) ~ ." )

fit <- randomForest ( f , data = iris_train , proximity = TRUE , ntree = 50 )

Search WWH ::

Custom Search

Home