Databases Reference
In-Depth Information
prior, or just there for smoothing? Should we choose it on principled
grounds or in order to maximize fit?) There is no institutionalized
knowledge because there are no institutions, and that's why the struc‐
ture of interactions matters: you can can create your own institutions.
You choose who you are influenced by, as Gabriel Tarde put it (via
Bruno Latour, via Mark Hansen):
When a young farmer, facing the sunset, does not know if he should
believe his school master asserting that the fall of the day is due to the
movement of the earth and not of the sun, or if he should accept as
witness his senses that tell him the opposite, in this case, there is one
imitative ray, which, through his school master, ties him to Galileo.
— Gabriel Tarde
Standing on the shoulders of giants is all well and good, but before
jumping on someone's back you might want to make sure that they
can take the weight. There is a focus in the business world to use data
science to sell advertisements. You may have access to the best dataset
in the world, but if the people employing you only want you to find
out how to best sell shoes with it, is it really worth it?
As we worked on assignments and compared solutions, it became clear
that the results of our analyses could vary widely based on just a few
decisions. Even if you've learned all the steps in the process from
hypothesis-building to results, there are so many ways to do each step
that the number of possible combinations is huge. Even then, it's not
as simple as piping the output of one command into the next. Algo‐
rithms are editorial, and the decision of which algorithm and variables
to use is even more so.
Claudia Perlich from Media 6 Degrees (M6D) was a winner of the
prestigious KDD Cup in 2003, 2007, 2008, 2009, and now can be seen
on the coordinating side of these competitions. She was generous
enough to share with us the ins and outs of the data science circuit and
the different approaches that you can take when making these editorial
decisions. In one competition to predict hospital treatment outcomes,
she had noticed that patient identifiers had been assigned sequentially,
such that all the patients from particular clinics had sequential num‐
bers. Because different clinics treated patients with different severities
of condition, the patient ID turned out to be a great predictor for the
outcome in question. Obviously, the inclusion of this data leakage was
unintentional. It made the competition trivial. But in the real world,
perhaps it should actually be used in models; after all, the clinic that
Search WWH ::




Custom Search