Databases Reference
In-Depth Information
Customer Experiments
There has been much interest in leveraging Pattern, Cascading, and Apache Hadoop to
run customer experiments at scale. The idea is to generate multiple variants of a pre‐
dictive model, each exported as PMML. Then run these models on a Hadoop cluster
with large-scale customer data. Finally, use analysis of the confusion matrix results to
measure the relative lift among models.
To show an example, first we need some data to use for an experiment. The code on
GitHub includes a Python script to generate sample data sets. Take a look at the
exam‐
ples/py/gen_orders.py
file. That script can be used to create a relatively large data set
(e.g., terabyte scale) for training and evaluating the PMML models on a Hadoop cluster:
#!/usr/bin/env python
# encoding: utf-8
import
random
import
sys
import
uuid
CUSTOMER_SEGMENTS
=
(
[
0.2
,
[
"0"
,
random
.
gauss
,
0.25
,
0.75
,
"
%0.2f
"
]],
[
0.8
,
[
"0"
,
random
.
gauss
,
1.5
,
0.25
,
"
%0.2f
"
]],
[
0.9
,
[
"1"
,
random
.
gauss
,
0.6
,
0.2
,
"
%0.2f
"
]],
[
1.0
,
[
"1"
,
random
.
gauss
,
0.75
,
0.2
,
"
%0.2f
"
]]
)
def
gen_row
(
segments
,
num_col
):
coin_flip
=
random
.
random
()
for
prob
,
rand_var
in
segments
:
if
coin_flip
<=
prob
:
(
label
,
dist
,
mean
,
sigma
,
f
)
=
rand_var
order_id
=
str
(
uuid
.
uuid1
())
.
split
(
"-"
)[
0
]
d
=
dist
(
mean
,
sigma
)
m
=
map
(
lambda
x
:
f
%
d
,
range
(
0
,
num_col
))
return
[
label
]
+
m
+
[
order_id
]
if
__name__
==
'__main__'
:
num_row
=
int
(
sys
.
argv
[
1
])
num_col
=
int
(
sys
.
argv
[
2
])
m
=
map
(
lambda
x
:
"v"
+
str
(
x
),
range
(
0
,
num_col
))
print
"
\t
"
.
join
([
"label"
]
+
m
+
[
"order_id"
])
for
i
in
range
(
0
,
num_row
):
print
"
\t
"
.
join
(
gen_row
(
CUSTOMER_SEGMENTS
,
num_col
))