Daniel Tunkelang - Data Scientists at Work

Database Reference

In-Depth Information

Gutierrez: What are your thoughts about systematic biases, overfitting mod-

els, and “too-clever” versus “works-way-better”?

Tunkelang: I'm a big fan of Occam's razor, which in modeling translates into

a preference for minimum description length. I prefer small, understandable

models, where the coefficients use as few bits as possible. If a model is going

to be more complex, it has to prove itself to be more accurate in online test-

ing. And an increase in accuracy doesn't always justify an increase in complex-

ity, as more complex models are harder to debug when they break.

As for systematic bias and overfitting, it's always an ongoing concern. We

try to anticipate it by carefully reviewing our methods for collecting training

data and considering every way we may have introduced bias. Worst case,

we discover bias in our training data when models that perform well on our

training data fail against withheld data. Then we use these failures to improve

our collection process.

Gutierrez: How would you describe your work to someone who is not

familiar with it but familiar with data science?

Tunkelang: As data scientists, our job is to extract signal from noise. We do

this in many contexts, from performing analyses that drive business strategy

to enabling data products like recommender systems. In the context of search

quality, that means analyzing the content we index and the way searchers inter-

act with it to deliver relevant results and improve the search experience.

Gutierrez: What in your career are you most proud of so far?

Tunkelang: What my colleagues and I accomplished at Endeca was some-

thing extraordinary. We helped change how people think about search, and I

see the effects every day that I browse the web.

Gutierrez: When did you realize you wanted to work with data as a career?

Tunkelang: I'm not sure there was any particular moment of realization.

I always loved math and computer science. Early on, I was more tempted

by theory than practice, obsessed with open problems in combinatorics and

computational complexity. But ultimately I couldn't resist working on prob-

lems with practical consequences, and that's how I found myself specializing in

information retrieval and data science more broadly.

Gutierrez: How did you get interested in working with data?

Tunkelang: One of the problems I worked on at IBM was visualizing semantic

networks obtained by applying natural language processing algorithms to large

document collections. Even though my focus was on the network visualiza-

tion algorithms, I couldn't help noticing that the natural language processing

algorithms had their good moments and bad moments. And that there was

only so much I could do with visualization algorithms if the raw data was noisy.

Search WWH ::

Custom Search

Home