Databases Reference
In-Depth Information
Let's create a model in R, then export it as PMML, and run that model on Hadoop. The
following example uses a well-known public domain data set called Iris , which is based
on a 1936 botanical study of three species of Iris flower. Look in data/iris.rf.tsv for an
example of this data:
sepal_length sepal_width petal_length petal_width species predict
5.1 3.5 1.4 0.2 setosa setosa
4.9 3.0 1.4 0.2 setosa setosa
5.6 2.5 3.9 1.1 versicolor versicolor
5.9 3.2 4.8 1.8 versicolor virginica
6.3 3.3 6.0 2.5 virginica virginica
4.9 2.5 4.5 1.7 virginica versicolor
Next, we'll create a predictive model using a machine learning algorithm called Random
Forest (RF) . Random Forest is an ensemble learning method based on using a statistical
technique called “bagging” with decision trees. The general idea is that one decision tree
is probably never enough to capture the possible variations in a large data set. Instead,
we create a collection of decision trees to help explain the various edge cases while
avoiding overfitting.
In this example, the RF model uses flower measurements such as petal length to predict
the iris species. The Iris data set is particularly interesting in statistics because it is prov‐
ably impossible to predict all the edge cases correctly using simple linear regression
methods. That presents an excellent use case for RF. The algorithm gets used widely for
this reason in domains that have lots of important edge cases: for example, in finance
for anti-fraud detection, and in astrophysics for detecting cosmological anomalies.
Take a look at the source code in examples/r/pmml_models.R , in particular the section
that handles RF modeling. Here is an R script for just that model, based on the Random
Forest implementation in R :
install.packages ( "pmml" )
install.packages ( "randomForest" )
library ( pmml )
library ( randomForest )
require ( graphics )
## split data into test and train sets
data ( iris )
iris_full <- iris
colnames ( iris_full ) <-
c ( "sepal_length" , "sepal_width" , "petal_length" , "petal_width" , "species" )
idx <- sample ( 150 , 100 )
iris_train <- iris_full [ idx ,]
iris_test <- iris_full [ - idx ,]
## train a Random Forest model
f <- as.formula ( "as.factor(species) ~ ." )
fit <- randomForest ( f , data = iris_train , proximity = TRUE , ntree = 50 )
Search WWH ::




Custom Search