Database Reference
In-Depth Information
need a second pair of eyes or a second brain to look at this.” Not having them
be a subordinate makes this type of conversation better. So we try to keep it
pretty level here.
Gutierrez: What are the responsibilities of the Data Science Group?
Perlich: We have three main responsibilities—models, performance moni-
toring, and fraud detection in addition to communicating with people outside
of our group. On the model side, we now build on the order of 10,000 predic-
tive models a week, each of which lives in this very high-dimensional space.
These models are based on the URL histories that we prune down to maybe
2 million URLs from the data set of 10 million URLs or more. This process is
completely automated. Even with a team of six people, we are not going to
look at 10,000 models. It is not happening.
Sometimes the modeling work means building very specific models and pro-
totypes as well. For instance, one thing we did recently is build a bidding model
that evaluates not just the history of what a person has done before, but spe-
cifically estimates what is the correct bid price for this person in a real-time
advertising auction based on what the person is doing right now or recondi-
tioning the bid based on how likely we think that person is a runner. So we
build a prototype, we run it on a small scale of production to see if it works,
and then we supervise the automation. Then it is built by our engineering
team with a full-strength and fully automated process that contains a quality
assurance part that sends warnings if things go wrong.
On the monitoring side, we supervise the performance of what is going on
with our models and how they are performing. Some of this is watching the
performance, and other parts are dealing with the QA process if/when it
sends out warning that things are going wrong. A final part is actually doing
the exploring if something is wrong.
On the fraud detection side, this is always going on. We have to deal with a
great deal of advertising fraud. We receive about 30 billion bid requests a day.
We have about 30 milliseconds to decide whether or not we want to bid on
a specific request when it comes in. If I bid and win, then our system shows
an ad in this specific real-time auction. The problem is that a good chunk of
those bid requests are bots, artificial traffic, or nonintentional web page visits
that are unlikely to actually ever be seen by anybody. This causes the fraud
problem on our side to actually be two problems: one—deciding whether or
not the traffic is fraudulent, and so whether or not to show ads, and two—
understanding how the traffic that is deemed fraudulent affects our models.
As traffic data on ads we have shown is part of our models, fraudulent traffic
is fed into the models, which means we have to think very hard about how
to counteract how fraud data affects our models. Interestingly, models are
much better at finding out who is a bot and who is not because bots display
deterministic behavior. This stands out because our models are predicting the
 
Search WWH ::




Custom Search