Beyond MapReduce - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

We run this script with command-line arguments to specify the number of rows and

columns. For example, the following creates 1,000 rows with 50 independent variables

each:

./examples/py/gen_orders.py 50 1000

A small example is given in the data/sample.tsv file:

label var0 var1 var2 order_id predict

1 0 1 0 6f8e1014 1

0 0 0 1 6f8ea22e 0

1 0 1 0 6f8ea435 1

...

Next, we use this data to create a model based on Random Forest—like in the earlier

example. The label dependent variable gets predicted based on var0 , var1 , and var2

as independent variables:

## train a Random Forest model

## example: http://mkseo.pe.kr/stats/?p=220

f <- as.formula ( "as.factor(label) ~ var0 + var1 + var2" )

fit <- randomForest ( f , data = data , proximity = TRUE , ntree = 25 )

print ( fit )

saveXML ( pmml ( fit ), file = "sample.rf.xml" )

Output from R shows an estimated 14% error rate for this model:

OOB estimate of error rate: 14 %

Confusion matrix:

0 1 class.error

0 69 16 0.1882353

1 12 103 0.1043478

Next, we use the same data to train a model based on a different algorithm, Logistic

Regression . To help illustrate experiment results later, one of the independent variables

var1 is omitted from the model:

## train a Logistic Regression model (special case of GLM)

## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r

f <- as.formula ( "as.factor(label) ~ var0 + var2" )

fit <- glm ( f , family = binomial , data = data )

print ( summary ( fit ))

saveXML ( pmml ( fit ), file = "sample.lr.xml" )

Now we can use the predefined app in Pattern to run both models and collect their

confusion matrix results:

$ rm -rf out

$ hadoop jar build/libs/pattern-examples-*.jar \

data/sample.tsv out/classify.rf out/trap \

--pmml sample.rf.xml --measure out/measure

$ mv out/classify.rf .

Search WWH ::

Custom Search

Home