Daniel Tunkelang - Data Scientists at Work

Database Reference

In-Depth Information

The results were useful in two ways. First, we could provide a link to the home

page on the local business page. Second, we could improve the association

as a signal to web search relevance to better determine when the intent of

the searcher was to find a local business. When web search determines this

intent, it typically shows a map and other information relevant to this class

of search queries. This was a fun machine learning problem, and our accu-

racy not only improved the quality of the local search pages but also helped

Google figure out when web searchers were looking for a local business so

that it could respond with maps and other appropriate content.

When I arrived at Google, there was already a system in place to map busi-

nesses to home pages. It was a machine learning system—specifically, it used

logistic regression to assign scores to candidate home pages for businesses.

I can't disclose numbers, but there was lots of room to improve its precision

and coverage. Moreover, the model was unstable and difficult to interpret,

making it difficult to use for work on incremental improvements to it. So

we decided to explore other approaches that would not only improve our

system's accuracy, but also facilitate ongoing work to improve it.

I can't say too much about our results—the numbers are confidential under

my NDA. But what I can say is that we significantly improved accuracy through

a series of changes that included switching from a logistic regression model to

a decision tree approach. That was surprising, since decision trees are hardly

cutting-edge machine learning models. However, they are very interpretable and

that interpretability made it much easier for us to gain insight and iterate.

Gutierrez: Do you find that non-cutting-edge models sometimes work bet-

ter than newer models as they are applied to new domains?

Tunkelang: I'm not saying that non-cutting-edge models work better—indeed,

I'd like to think that progress in machine learning ensures the opposite! Rather,

it pays to keep things simple when you're trying to understand your data and

iteratively develop models for it. In those cases, it's better to optimize for

interpretability than accuracy. Once you've learned as much as you can, you

can go back to more complex models. When you go back to them, you'll

hopefully now have the right training data, objective function, and features to

take advantage of the latest and greatest machine learning has to offer.

Gutierrez: How important is it to continue working on models that have

already been built?

Tunkelang: There's no preference for replacing versus improving models.

We put most of our efforts into collecting better training data and coming

up with new features. Those usually require us to train new models. There's

some bias towards reusing our existing infrastructure, because that's usually

less work and helps us avoid introducing new bugs. But we do our best to

evaluate models on their own merits, even if that means doing more work to

take advantage of a new approach.

Search WWH ::

Custom Search

Home