Databases Reference
In-Depth Information
Claudia prefers the second kind, because it's closer to what you do in
real life.
How to Be a Good Modeler
Claudia claims that data and domain understanding are the single
most important skills you need as a data scientist. At the same time,
this can't really be taught—it can only be cultivated .
A few lessons learned about data mining competitions that Claudia
thinks are overlooked in academics:
The contestants' best friend and the organizer and practitioners'
worst nightmare. There's always something wrong with the data,
and Claudia has made an artform of figuring out how the people
preparing the competition got lazy or sloppy with the data.
Real-life performance measures
Adapting learning beyond standard modeling evaluation meas‐
ures like mean squared error (MSE), misclassification rate, or area
under the curve (AUC). For example, profit would be an example
of a real-life performance measure.
Feature construction/transformation
Real data is rarely flat (i.e., given to you in a beautiful matrix) and
good, practical solutions for this problem remain a challenge.
Data Leakage
In a KDD 2011 paper that Claudia coauthored called “Leakage in Data
Mining: Formulation, Detection, and Avoidance” , she, Shachar Kauf‐
man, and Saharon Rosset point to another author, Dorian Pyle, who
has written numerous articles and papers on data preparation in data
mining, where he refers to a phenomenon that he calls anachronisms
(something that is out of place in time), and says that “too good to be
true” performance is “a dead giveaway” of its existence. Claudia and
her coauthors call this phenomenon “data leakage” in the context of
predictive modeling. Pyle suggests turning to exploratory data analysis
in order to find and eliminate leakage sources. Claudia and her coau‐
thors sought a rigorous methodology to deal with leakage.
Leakage refers to information or data that helps you predict some‐
thing, and the fact that you are using this information to predict isn't
Search WWH ::

Custom Search