Databases Reference
In-Depth Information
We run this script with command-line arguments to specify the number of rows and
columns. For example, the following creates 1,000 rows with 50 independent variables
each:
./examples/py/gen_orders.py 50 1000
A small example is given in the data/sample.tsv file:
label var0 var1 var2 order_id predict
1 0 1 0 6f8e1014 1
0 0 0 1 6f8ea22e 0
1 0 1 0 6f8ea435 1
...
Next, we use this data to create a model based on Random Forest—like in the earlier
example. The label dependent variable gets predicted based on var0 , var1 , and var2
as independent variables:
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
f <- as.formula ( "as.factor(label) ~ var0 + var1 + var2" )
fit <- randomForest ( f , data = data , proximity = TRUE , ntree = 25 )
print ( fit )
saveXML ( pmml ( fit ), file = "sample.rf.xml" )
Output from R shows an estimated 14% error rate for this model:
OOB estimate of error rate: 14 %
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478
Next, we use the same data to train a model based on a different algorithm, Logistic
Regression . To help illustrate experiment results later, one of the independent variables
var1 is omitted from the model:
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
f <- as.formula ( "as.factor(label) ~ var0 + var2" )
fit <- glm ( f , family = binomial , data = data )
print ( summary ( fit ))
saveXML ( pmml ( fit ), file = "sample.lr.xml" )
Now we can use the predefined app in Pattern to run both models and collect their
confusion matrix results:
$ rm -rf out
$ hadoop jar build/libs/pattern-examples-*.jar \
data/sample.tsv out/classify.rf out/trap \
--pmml sample.rf.xml --measure out/measure
$ mv out/classify.rf .
Search WWH ::




Custom Search