Daniel Tunkelang - Data Scientists at Work

Database Reference

In-Depth Information

Gutierrez: What specific tools and techniques do you use?

Tunkelang: We use the usual tricks of the data science trade—machine learning

models, A/B testing, crowdsourced evaluation, data collection, and similar

techniques. Most importantly, we look at data and logs below the aggregate

level. It's easy to be lazy and look at aggregates—for example, favoring one

machine-learned model over another because it performs better on average.

Drilling down into the differences and looking at specific examples is often

what gives us a real understanding of what's going on.

Gutierrez: What nascent tool are you most excited about?

Tunkelang: I'm not sure it still qualifies as nascent, but I'm very excited

about human computation. I can't imagine data science today without crowd-

sourcing for data collection and evaluation. For example, I'm studying Italian

using Duolingo, a free language-learning app that doubles as a crowdsourced

text translation platform. These are early days for human computation, and I

expect we'll see even more powerful applications over the next years.

Gutierrez: You mentioned drilling down to get a real understanding of a

model. How do you measure real understanding?

Tunkelang: I don't know of a quantitative metric for understanding. But

consequences of understanding are easy to quantify. When we realize that

a model improves performance for one user segment, but degrades it for

others, we have a starting point to investigate why. And hopefully we end up

with a richer model—or perhaps two distinct models—that allow us to per-

form better for both segments. Ideally, we learn even more as we get a better

understanding of what distinguishes our segments and insights that carry over

to the rest of our user base beyond those segments.

Gutierrez: How do you communicate your results to other groups in the

company?

Tunkelang: How we present and communicate our work to the rest of the company

varies. We give presentations to our peers who work on similar relevance and data

science problems. But sometimes we work with teams more tightly because our

work is highly related. For example, there are relationships between the abusive

search engine optimization team and the fraud team. One thing we've learned is

that there's no such thing as over-communicating. No one ever complains that they

have too much access to information about what their peers are doing.

Gutierrez: You've mentioned logs in previous answers. Do you have a system

to help you look at these logs?

Tunkelang: We have a variety of in-house reporting tools that we use for

regular log analysis. And when those aren't flexible enough, we use tools

like Hive or Pig to perform ad hoc analysis. Of course, a crucial part of this

process is that we instrument and track everything. And we've built a variety

Search WWH ::

Custom Search

Home