Databases Reference
In-Depth Information
$ rm -rf out
$ hadoop jar build/libs/pattern-examples-*.jar \
data/sample.tsv out/classify.lr out/trap \
--pmml sample.lr.xml --measure out/measure
$ mv out/classify.lr .
It would be reasonably simple to build a Cascading app to do the comparisons between
models, i.e., a framework for customer experiments. That would be especially useful if
there were a large number of models to compare. In this case, we can compare results
using a spreadsheet as shown in Figure 6-10 .
Figure 6-10. Customer experiment
The model based on Logistic Regression has a lower rate (5% versus 11%) for false
negatives (FN). However, that model has a much higher rate (52% versus 14%) for false
positives (FP).
Let's put this into terms that decision makers use in business to determine which model
is better. For example, in the case of an anti-fraud classifier used in ecommerce, we can
assign a cost function to select a winner of the experiment. On one hand, a higher rate
of false negatives implies that more fraudulent orders fail to get flagged for review.
Ultimately that results in a higher rate of chargeback fines from the bank, and punitive
actions by the credit card processor if that rate goes too high for too long. So the FN
rate is proportional to chargeback risk in ecommerce. On the other hand, a higher rate
of false positives implies that more legitimate orders get flagged for review. Ultimately
that results in more complaints from actual customers, and higher costs for customer
support. So the FP rate is proportional to support costs in ecommerce.
Evaluating this experiment, the Logistic Regression model—which had a variable omit‐
ted to exaggerate the comparison—resulted in approximately half the FN rate, compared
with the Random Forest model. However, it also resulted in quadrupled costs for cus‐
tomer support. A decision maker can use those cost trade-offs to select the appropriate
model for the business needs.
 
Search WWH ::




Custom Search