Databases Reference
In-Depth Information
doctors and patients choose should probably be used to predict their
outcomes.
David Madigan emphasized the ethical challenges of editorial deci‐
sions in this emerging domain by showing us how observational stud‐
ies in the pharmaceutical industry often yield vastly different results.
(Another example is the aspirin plot he showed.) He emphasized the
importance of not removing oneself from the real data. It is not enough
to merely tweak models and methods and apply them to datasets.
The academic world has a bit of the same problem as the business
world, but for different reasons. The different bits of data science are
so split between disciplines that by studying them individually it be‐
comes nearly impossible to get a holistic view of how these chunks fit
together, or even that they could fit together. A purely academic ap‐
proach to data science can sterilize and quantize it to the point where
you end up with the following, which is an actual homework problem
from the chapter “Linear Methods for Regression” in The Elements of
Statistical Learning :
Ex. 3.2 Given data on two variables X and Y , consider fitting a cubic
polynomial regression model f X =∑ j = 3 β j X j In addition to plot‐
ting the fitted curve, you would like a 95% confidence band about the
curve. Consider the following two approaches:
1. At each point x 0 , form a 95% confidence interval for the linear
function α T β =∑ j = 3 β j x 0 j .
2. Form a 95% confidence set for β , which in turn generates confi‐
dence intervals for f x 0 .
How do these approaches differ? Which band is likely to be wider?
Conduct a small simulation experiment to compare the two methods.
This is the kind of problem you might be assigned in a more general
machine learning or data mining class. As fledgling data scientists, our
first reaction is now skeptical. At which point in the process of doing
data science would a problem like this even present itself ? How much
would we have to have already done to get to this point? Why are we
considering these two variables in particular? How come we're given
data? Who gave it to us? Who's paying them? Why are we calculating
95% confidence intervals? Would another performance metric be
 
Search WWH ::




Custom Search