Claudia Perlich - Data Scientists at Work

Database Reference

In-Depth Information

I also like the freedom that I have here in terms of giving talks and being out

there. I tend to get bored doing the same thing over and over, so I need a

variety of things to do at work. There are always different problems with data,

so there is no need to worry about ever being stuck with the same problem

for too long. I never really had the patience and the control for being a formal,

good coder, so I can just hack my way around in Perl or shell or whatever

the hell I want, and that suits my computer science skills the best from what

I can tell.

Gutierrez: What exciting nonwork data puzzles have you played with?

Perlich: A great example comes from a data set that we used to predict breast

cancer. Siemens Medical provided an fMRI [Functional Magnetic Resonance

Imaging] data set that came coded with 117 numeric variables. The puzzle was

to use this data set to predict whether a data sample showed a malignant cancer

or not. It turned out that the most predictive feature in that data set was

the patient ID, the random number that they had assigned to the patient. The

reason for this is that they had to pool different data sets together because

they did not get enough positive data from any one particular source. So they

had to source data sets from different places, and as it turned out, some of the

treatment centers had very high breast cancer prevalence rates. Patients had

random numbers that were assigned by each location. So the model was able

to figure out that a patient being in a specific treatment center was a great

indicator of whether the sample was malignant or not.

The bigger picture of this work is asking the question: “Is this data set suit-

able if we really wanted to build a model that identifies breast cancer?” The

answer is, “It depends.” You cannot ignore it, because even if you do not use

the patient identifier, which, of course, you do not really want to have in your

model, the model still finds a kind of the calibration of the grayscale. So the

model still implicitly learns from the location. If you want to use that model on

a different set of locations, it is obviously not going to work at all. If you want

to use it on the same set of locations, you should just basically put an identifier

for location in there. That is the best model that you can build.

The interesting observation from this is that you really had to change the

data set or augment it if you want to make it useful. That was just one of

these accidents where you are looking at it and you think to yourself, “This is

strange. This seems like a weird story.” That is what was really fun. These are

the hidden stories in the data collection that I want to get to the bottom of

when I work with data. I find that type of thinking and work makes me very

happy. I get really excited by the somewhat abstract intellectual challenge of it.

What amazes me is how much a data set contradicts my expectations. If the

data is just what I expected it to be—it is surprisingly clean maybe, but it does

not have the puzzle about it, then it does not really get me excited.

Search WWH ::

Custom Search

Home