Claudia Perlich - Data Scientists at Work

Database Reference

In-Depth Information

The first area is for the data collection and storage for analytics. The other

area is responsible for being able to do a real-time response in 30 milliseconds

to bid requests. So there you have technology with NoSQL—very high-speed

lookup tables, like Cassandra and other things.

For my day-to-day, we have a Hadoop cluster. All of the incoming events are

put into a standard format and then stored. We have event logs for every-

thing—bid requests, impressions, clicks, conversions, all the visitation data, and

so on. I want everything logged. We record them as event logs, with certain

lookback times and fields.

They are housed in a Hadoop cluster, on top of which we have Apache Hive.

Hive is a tool that basically lets you query this data with more or less standard

SQL. It is not necessarily a real-time response. It is a little bit slow because of

the whole interaction with Hadoop, but I do not need real time. I just need to

get the data that I want. So I use Hive to get data out of Hadoop.

The key to working with this data is to figure out what exact data sample you

need, so it is about figuring out which Hive query will give it to you. Typically,

I try to avoid going beyond 10 GB of data. For most things I need to do, I can

downsample significantly, as I do not need to process all the data. Once I have

figured out which downsampled data set I need, then I just need to write the

correct query to get the piece that I need.

Gutierrez: What kind of tools do you use to preprocess the data?

Perlich: I use a lot of UNIX tools—sed, awk, sort, grep, and others. You name

it, I probably use it. I also write a lot of my own code in Perl. I do a lot of script-

ing that runs over that data. The scripting is done not so much for analytics.

Rather it is done to preprocess the data into a state where I can then run it

through some special-purpose tool.

Gutierrez: What types of special-purpose tools have you built?

Perlich: For the hard-core modeling that we do, we have our own implemen-

tation of a stochastic gradient descent logistic regression. That thing takes

something along the lines of 10 million examples with 10 million features and

within 5 to 10 minutes you get an answer. It is not parallelized, but it is really

to the point and implemented very well for our specific use case of dealing

with this kind of sparse data.

We are very much a self-made shop, so we are not using any kind of com-

mercial tooling. So we build our own specialized solutions. When we need

something, I typically start digging around in the academic literature and say

something like, “Okay, let's see what SVMlight (some specific implementation

of a Support Vector Machine Algorithm out of Cornell) is doing.” I first check

on performance, and even if it takes three hours, that is fine. Maybe we try ran-

dom forest. I can get examples of this code, and then we see what works well

Search WWH ::

Custom Search

Home