Databases Reference
In-Depth Information
Customer Experiments
There has been much interest in leveraging Pattern, Cascading, and Apache Hadoop to
run customer experiments at scale. The idea is to generate multiple variants of a pre‐
dictive model, each exported as PMML. Then run these models on a Hadoop cluster
with large-scale customer data. Finally, use analysis of the confusion matrix results to
measure the relative lift among models.
To show an example, first we need some data to use for an experiment. The code on
GitHub includes a Python script to generate sample data sets. Take a look at the exam‐
ples/py/gen_orders.py file. That script can be used to create a relatively large data set
(e.g., terabyte scale) for training and evaluating the PMML models on a Hadoop cluster:
#!/usr/bin/env python
# encoding: utf-8
import random
import sys
import uuid
CUSTOMER_SEGMENTS = (
[ 0.2 , [ "0" , random . gauss , 0.25 , 0.75 , " %0.2f " ]],
[ 0.8 , [ "0" , random . gauss , 1.5 , 0.25 , " %0.2f " ]],
[ 0.9 , [ "1" , random . gauss , 0.6 , 0.2 , " %0.2f " ]],
[ 1.0 , [ "1" , random . gauss , 0.75 , 0.2 , " %0.2f " ]]
)
def gen_row ( segments , num_col ):
coin_flip = random . random ()
for prob , rand_var in segments :
if coin_flip <= prob :
( label , dist , mean , sigma , f ) = rand_var
order_id = str ( uuid . uuid1 ()) . split ( "-" )[ 0 ]
d = dist ( mean , sigma )
m = map ( lambda x : f % d , range ( 0 , num_col ))
return [ label ] + m + [ order_id ]
if __name__ == '__main__' :
num_row = int ( sys . argv [ 1 ])
num_col = int ( sys . argv [ 2 ])
m = map ( lambda x : "v" + str ( x ), range ( 0 , num_col ))
print " \t " . join ([ "label" ] + m + [ "order_id" ])
for i in range ( 0 , num_row ):
print " \t " . join ( gen_row ( CUSTOMER_SEGMENTS , num_col ))
Search WWH ::




Custom Search