Jonathan Lenaghan - Data Scientists at Work

Database Reference

In-Depth Information

Gutierrez: What is the hit rate of things that work the first time?

Lenaghan: Around 50 percent. Our teams operate by always trying to build

a prototype first. On the data science side, this initial prototype is usually a

mixture of Java and/or Python and/or R. Again, we always try to keep our eye

on what the final piece is going to be. If we know that performance is going

to be a problem, we may start in Java from the very beginning. If we do build

a prototype, we usually make it as lightweight as possible.

Gutierrez: Why as lightweight as possible?

Lenaghan: I do not like writing a lot of code or doing a lot of work for some-

thing I do not know is going to succeed. So we build the prototype and start

working on it with small data sets first. One of the first tests that we do is a

scaling test. Even if the prototype is not super-performant, we want to make

sure that it is capable of processing all of the data. Even if our prototype code

is six times slower than the production code we are eventually going to write,

we do want to be sure that it is capable of processing terabytes of data.

Gutierrez: If the prototype performs well, what happens next?

Lenaghan: If the prototype performs well on the scaling test, then we move

to the production phase. I would say that about 60 percent of the time we

involve engineering, and about 40 percent of the time we do it ourselves. If we

need something really performant and it is complicated and involves a lot of

configuration, then we always involve engineering there. Eventually there is a

process to migrate the prototype to production code. Engineering will push

our combined work to the dev ops group, which is where it is moved into

production. Then we monitor it and hopefully never touch it again.

Gutierrez: How do you do the scaling test?

Lenaghan: We slowly step up the scale of data we run through the pro-

totype in two dimensions. We have the geospatial dimension, which is large,

but not extremely large. There we are talking about hundreds of millions of

entities, let's say, in the United States. We also have the second dimension,

which we think of as the movement side. This is the data coming from the

ad-request side. This data is on the order of tens of billions of data points per

month. We want to understand how well the prototype scales up in the two

dimensions—the spatial dimension and the movement dimension. Usually, we

start on the geospatial side and apply our analysis to just one metro area. For

various reasons, we always use San Francisco. We could use New York City,

but Manhattan is too anomalous.

Gutierrez: San Francisco is the base metro area for the spatial dimension

testing?

Lenaghan: Exactly. We set the initial geographic scale starting with the metro,

and then on the movement side, we will start with a day's worth of data. Then

we scale the data to a week's worth of data. Then we scale up the data to

Search WWH ::

Custom Search

Home