Databases Reference
In-Depth Information
This had to do with how the person prepared the data for the compe‐
tition, as depicted in Figure 13-2 .
Figure 13-2. How data preparation was done for the INFORMS
competition
The diagnosis code for pneumonia was 486. So the preparer removed
that (and replaced it with a “-1”) if it showed up in the record (rows
are different patients; columns are different diagnoses; there is a max‐
imum of four diagnoses; “-1” means there's nothing for that entry).
Moreover, to avoid telling holes in the data, the preparer moved the
other diagnoses to the left if necessary, so that only “-1”s were on the
right.
There are two problems with this:
• If the row has only “-1”s, then you know it started out with only
pneumonia.
• If the row has no “-1”s, you know there's no pneumonia (unless
there are actually five diagnoses, but that's less common).
This alone was enough information to win the competition.
Leakage Happens
Winning a competition on leakage is easier than building
good models. But even if you don't explicitly understand and
game the leakage, your model will do it for you. Either way,
leakage is a huge problem with data mining contests in
general.
 
Search WWH ::




Custom Search