Database Reference
In-Depth Information
Gutierrez: What is the hit rate of things that work the first time?
Lenaghan: Around 50 percent. Our teams operate by always trying to build
a prototype first. On the data science side, this initial prototype is usually a
mixture of Java and/or Python and/or R. Again, we always try to keep our eye
on what the final piece is going to be. If we know that performance is going
to be a problem, we may start in Java from the very beginning. If we do build
a prototype, we usually make it as lightweight as possible.
Gutierrez: Why as lightweight as possible?
Lenaghan: I do not like writing a lot of code or doing a lot of work for some-
thing I do not know is going to succeed. So we build the prototype and start
working on it with small data sets first. One of the first tests that we do is a
scaling test. Even if the prototype is not super-performant, we want to make
sure that it is capable of processing all of the data. Even if our prototype code
is six times slower than the production code we are eventually going to write,
we do want to be sure that it is capable of processing terabytes of data.
Gutierrez: If the prototype performs well, what happens next?
Lenaghan: If the prototype performs well on the scaling test, then we move
to the production phase. I would say that about 60 percent of the time we
involve engineering, and about 40 percent of the time we do it ourselves. If we
need something really performant and it is complicated and involves a lot of
configuration, then we always involve engineering there. Eventually there is a
process to migrate the prototype to production code. Engineering will push
our combined work to the dev ops group, which is where it is moved into
production. Then we monitor it and hopefully never touch it again.
Gutierrez: How do you do the scaling test?
Lenaghan: We slowly step up the scale of data we run through the pro-
totype in two dimensions. We have the geospatial dimension, which is large,
but not extremely large. There we are talking about hundreds of millions of
entities, let's say, in the United States. We also have the second dimension,
which we think of as the movement side. This is the data coming from the
ad-request side. This data is on the order of tens of billions of data points per
month. We want to understand how well the prototype scales up in the two
dimensions—the spatial dimension and the movement dimension. Usually, we
start on the geospatial side and apply our analysis to just one metro area. For
various reasons, we always use San Francisco. We could use New York City,
but Manhattan is too anomalous.
Gutierrez: San Francisco is the base metro area for the spatial dimension
testing?
Lenaghan: Exactly. We set the initial geographic scale starting with the metro,
and then on the movement side, we will start with a day's worth of data. Then
we scale the data to a week's worth of data. Then we scale up the data to
 
Search WWH ::




Custom Search