Databases Reference
In-Depth Information
CHAPTER 11
Causality
Many of the models and examples in the topic so far have been focused
on the fundamental problem of prediction. We've discussed examples
like in
Chapter 8
, where your goal was to build a model to predict
whether or not a person would be likely to prefer a certain item—a
movie or a book, for example. There may be thousands of features that
go into the model, and you may use feature selection to narrow those
down, but ultimately the model is getting optimized in order to get
the highest accuracy. When one is optimizing for accuracy, one doesn't
necessarily worry about the
meaning
or
interpretation
of the features,
and especially if there are thousands of features, it's well-near impos‐
sible to interpret at all.
Additionally, you wouldn't even want to make the statement that cer‐
tain characteristics
caused
the person to buy the item. So, for example,
your model for predicting or recommending a topic on Amazon could
include a feature “whether or not you've read Wes McKinney's O'Reilly
book
Python for Data Analysis
.” We wouldn't say that reading his topic
caused
you to read
this
topic. It just might be a good predictor, which
would have been discovered and come out as such in the process of
optimizing for accuracy. We wish to emphasize here that it's not simply
the familiar correlation-causation trade-off you've perhaps had drilled
into your head already, but rather that your
intent
when building such
a model or system was not even to understand causality at all, but
rather to
predict
. And that if your intent
were
to build a model that
helps you get at causality, you would go about that in a different way.
A whole different set of real-world problems that actually use the same
statistical methods (logistic regression, linear regression) as part of the