Databases Reference
In-Depth Information
How to Avoid Leakage
The message here is not about how to win predictive modeling com‐
petitions. The reality is that as a data scientist, you're at risk of pro‐
ducing a data leakage situation any time you prepare, clean your data,
impute missing values, remove outliers, etc. You might be distorting
the data in the process of preparing it to the point that you'll build a
model that works well on your “clean” dataset, but will totally suck
when applied in the real-world situation where you actually want to
apply it. Claudia gave us some very specific advice to avoid leakage.
First, you need a strict temporal cutoff: remove all information just
prior to the event of interest. For example, stuff you know before a
patient is admitted . There has to be a timestamp on every entry that
corresponds to the time you learned that information, not the time it
occurred. Removing columns and rows from your data is asking for
trouble, specifically in the form of inconsistencies that can be teased
out. The best practice is to start from scratch with clean, raw data after
careful consideration. Finally, you need to know how the data was
created!
Claudia and her coauthors describe in the paper referenced earlier a
suggested methodology for avoiding leakage as a two-stage process of
tagging every observation with legitimacy tags during collection and
then observing what they call a learn-predict separation.
Evaluating Models
How do you know that your model is any good? We've gone through
this already in some previous chapters, but it's always good to hear this
again from a master.
With powerful algorithms searching for patterns of models, there is a
serious danger of overfitting. It's a difficult concept, but the general
idea is that “if you look hard enough, you'll find something,” even if it
does not generalize beyond the particular training data.
To avoid overfitting, we cross-validate and cut down on the complexity
of the model to begin with. Here's a standard picture in Figure 13-3
(although keep in mind we generally work in high dimensional space
and don't have a pretty picture to look at).
Search WWH ::




Custom Search