The Students Speak - Doing Data Science

Databases Reference

In-Depth Information

doctors and patients choose should probably be used to predict their

outcomes.

David Madigan emphasized the ethical challenges of editorial deci‐

sions in this emerging domain by showing us how observational stud‐

ies in the pharmaceutical industry often yield vastly different results.

(Another example is the aspirin plot he showed.) He emphasized the

importance of not removing oneself from the real data. It is not enough

to merely tweak models and methods and apply them to datasets.

The academic world has a bit of the same problem as the business

world, but for different reasons. The different bits of data science are

so split between disciplines that by studying them individually it be‐

comes nearly impossible to get a holistic view of how these chunks fit

together, or even that they could fit together. A purely academic ap‐

proach to data science can sterilize and quantize it to the point where

you end up with the following, which is an actual homework problem

from the chapter “Linear Methods for Regression” in The Elements of

Statistical Learning :

Ex. 3.2 Given data on two variables X and Y , consider fitting a cubic

polynomial regression model f X =∑ j = 3 β j X j In addition to plot‐

ting the fitted curve, you would like a 95% confidence band about the

curve. Consider the following two approaches:

1. At each point x 0 , form a 95% confidence interval for the linear

function α T β =∑ j = 3 β j x 0 j .

2. Form a 95% confidence set for β , which in turn generates confi‐

dence intervals for f x 0 .

How do these approaches differ? Which band is likely to be wider?

Conduct a small simulation experiment to compare the two methods.

This is the kind of problem you might be assigned in a more general

machine learning or data mining class. As fledgling data scientists, our

first reaction is now skeptical. At which point in the process of doing

data science would a problem like this even present itself ? How much

would we have to have already done to get to this point? Why are we

considering these two variables in particular? How come we're given

data? Who gave it to us? Who's paying them? Why are we calculating

95% confidence intervals? Would another performance metric be

Search WWH ::

Custom Search

Home