Databases Reference
In-Depth Information
We run this script with command-line arguments to specify the number of rows and
columns. For example, the following creates 1,000 rows with 50 independent variables
each:
./examples/py/gen_orders.py 50 1000
A small example is given in the
data/sample.tsv
file:
label var0 var1 var2 order_id predict
1 0 1 0 6f8e1014 1
0 0 0 1 6f8ea22e 0
1 0 1 0 6f8ea435 1
...
Next, we use this data to create a model based on Random Forest—like in the earlier
example. The
label
dependent variable gets predicted based on
var0
,
var1
, and
var2
as independent variables:
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
f
<-
as.formula
(
"as.factor(label) ~ var0 + var1 + var2"
)
fit
<-
randomForest
(
f
,
data
=
data
,
proximity
=
TRUE
,
ntree
=
25
)
print
(
fit
)
saveXML
(
pmml
(
fit
),
file
=
"sample.rf.xml"
)
Output from R shows an estimated 14% error rate for this model:
OOB estimate of error rate:
14
%
Confusion matrix:
0
1
class.error
0
69
16
0.1882353
1
12
103
0.1043478
Next, we use the same data to train a model based on a different algorithm,
Logistic
Regression
.
To help illustrate experiment results later, one of the independent variables
var1
is omitted from the model:
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
f
<-
as.formula
(
"as.factor(label) ~ var0 + var2"
)
fit
<-
glm
(
f
,
family
=
binomial
,
data
=
data
)
print
(
summary
(
fit
))
saveXML
(
pmml
(
fit
),
file
=
"sample.lr.xml"
)
Now we can use the predefined app in Pattern to run both models and collect their
confusion matrix results:
$
rm -rf out
$
hadoop jar build/libs/pattern-examples-*.jar
\
data/sample.tsv out/classify.rf out/trap
\
--pmml sample.rf.xml --measure out/measure
$
mv out/classify.rf .