Database Reference
In-Depth Information
I also like the freedom that I have here in terms of giving talks and being out
there. I tend to get bored doing the same thing over and over, so I need a
variety of things to do at work. There are always different problems with data,
so there is no need to worry about ever being stuck with the same problem
for too long. I never really had the patience and the control for being a formal,
good coder, so I can just hack my way around in Perl or shell or whatever
the hell I want, and that suits my computer science skills the best from what
I can tell.
Gutierrez: What exciting nonwork data puzzles have you played with?
Perlich: A great example comes from a data set that we used to predict breast
cancer. Siemens Medical provided an fMRI [Functional Magnetic Resonance
Imaging] data set that came coded with 117 numeric variables. The puzzle was
to use this data set to predict whether a data sample showed a malignant cancer
or not. It turned out that the most predictive feature in that data set was
the patient ID, the random number that they had assigned to the patient. The
reason for this is that they had to pool different data sets together because
they did not get enough positive data from any one particular source. So they
had to source data sets from different places, and as it turned out, some of the
treatment centers had very high breast cancer prevalence rates. Patients had
random numbers that were assigned by each location. So the model was able
to figure out that a patient being in a specific treatment center was a great
indicator of whether the sample was malignant or not.
The bigger picture of this work is asking the question: “Is this data set suit-
able if we really wanted to build a model that identifies breast cancer?” The
answer is, “It depends.” You cannot ignore it, because even if you do not use
the patient identifier, which, of course, you do not really want to have in your
model, the model still finds a kind of the calibration of the grayscale. So the
model still implicitly learns from the location. If you want to use that model on
a different set of locations, it is obviously not going to work at all. If you want
to use it on the same set of locations, you should just basically put an identifier
for location in there. That is the best model that you can build.
The interesting observation from this is that you really had to change the
data set or augment it if you want to make it useful. That was just one of
these accidents where you are looking at it and you think to yourself, “This is
strange. This seems like a weird story.” That is what was really fun. These are
the hidden stories in the data collection that I want to get to the bottom of
when I work with data. I find that type of thinking and work makes me very
happy. I get really excited by the somewhat abstract intellectual challenge of it.
What amazes me is how much a data set contradicts my expectations. If the
data is just what I expected it to be—it is surprisingly clean maybe, but it does
not have the puzzle about it, then it does not really get me excited.
 
Search WWH ::




Custom Search