Database Reference
In-Depth Information
Gutierrez: What are your thoughts about systematic biases, overfitting mod-
els, and “too-clever” versus “works-way-better”?
Tunkelang: I'm a big fan of Occam's razor, which in modeling translates into
a preference for minimum description length. I prefer small, understandable
models, where the coefficients use as few bits as possible. If a model is going
to be more complex, it has to prove itself to be more accurate in online test-
ing. And an increase in accuracy doesn't always justify an increase in complex-
ity, as more complex models are harder to debug when they break.
As for systematic bias and overfitting, it's always an ongoing concern. We
try to anticipate it by carefully reviewing our methods for collecting training
data and considering every way we may have introduced bias. Worst case,
we discover bias in our training data when models that perform well on our
training data fail against withheld data. Then we use these failures to improve
our collection process.
Gutierrez: How would you describe your work to someone who is not
familiar with it but familiar with data science?
Tunkelang: As data scientists, our job is to extract signal from noise. We do
this in many contexts, from performing analyses that drive business strategy
to enabling data products like recommender systems. In the context of search
quality, that means analyzing the content we index and the way searchers inter-
act with it to deliver relevant results and improve the search experience.
Gutierrez: What in your career are you most proud of so far?
Tunkelang: What my colleagues and I accomplished at Endeca was some-
thing extraordinary. We helped change how people think about search, and I
see the effects every day that I browse the web.
Gutierrez: When did you realize you wanted to work with data as a career?
Tunkelang: I'm not sure there was any particular moment of realization.
I always loved math and computer science. Early on, I was more tempted
by theory than practice, obsessed with open problems in combinatorics and
computational complexity. But ultimately I couldn't resist working on prob-
lems with practical consequences, and that's how I found myself specializing in
information retrieval and data science more broadly.
Gutierrez: How did you get interested in working with data?
Tunkelang: One of the problems I worked on at IBM was visualizing semantic
networks obtained by applying natural language processing algorithms to large
document collections. Even though my focus was on the network visualiza-
tion algorithms, I couldn't help noticing that the natural language processing
algorithms had their good moments and bad moments. And that there was
only so much I could do with visualization algorithms if the raw data was noisy.
 
Search WWH ::




Custom Search