Claudia Perlich - Data Scientists at Work

Database Reference

In-Depth Information

need a second pair of eyes or a second brain to look at this.” Not having them

be a subordinate makes this type of conversation better. So we try to keep it

pretty level here.

Gutierrez: What are the responsibilities of the Data Science Group?

Perlich: We have three main responsibilities—models, performance moni-

toring, and fraud detection in addition to communicating with people outside

of our group. On the model side, we now build on the order of 10,000 predic-

tive models a week, each of which lives in this very high-dimensional space.

These models are based on the URL histories that we prune down to maybe

2 million URLs from the data set of 10 million URLs or more. This process is

completely automated. Even with a team of six people, we are not going to

look at 10,000 models. It is not happening.

Sometimes the modeling work means building very specific models and pro-

totypes as well. For instance, one thing we did recently is build a bidding model

that evaluates not just the history of what a person has done before, but spe-

cifically estimates what is the correct bid price for this person in a real-time

advertising auction based on what the person is doing right now or recondi-

tioning the bid based on how likely we think that person is a runner. So we

build a prototype, we run it on a small scale of production to see if it works,

and then we supervise the automation. Then it is built by our engineering

team with a full-strength and fully automated process that contains a quality

assurance part that sends warnings if things go wrong.

On the monitoring side, we supervise the performance of what is going on

with our models and how they are performing. Some of this is watching the

performance, and other parts are dealing with the QA process if/when it

sends out warning that things are going wrong. A final part is actually doing

the exploring if something is wrong.

On the fraud detection side, this is always going on. We have to deal with a

great deal of advertising fraud. We receive about 30 billion bid requests a day.

We have about 30 milliseconds to decide whether or not we want to bid on

a specific request when it comes in. If I bid and win, then our system shows

an ad in this specific real-time auction. The problem is that a good chunk of

those bid requests are bots, artificial traffic, or nonintentional web page visits

that are unlikely to actually ever be seen by anybody. This causes the fraud

problem on our side to actually be two problems: one—deciding whether or

not the traffic is fraudulent, and so whether or not to show ads, and two—

understanding how the traffic that is deemed fraudulent affects our models.

As traffic data on ads we have shown is part of our models, fraudulent traffic

is fed into the models, which means we have to think very hard about how

to counteract how fraud data affects our models. Interestingly, models are

much better at finding out who is a bot and who is not because bots display

deterministic behavior. This stands out because our models are predicting the

Search WWH ::

Custom Search

Home