Database Reference
In-Depth Information
The first area is for the data collection and storage for analytics. The other
area is responsible for being able to do a real-time response in 30 milliseconds
to bid requests. So there you have technology with NoSQL—very high-speed
lookup tables, like Cassandra and other things.
For my day-to-day, we have a Hadoop cluster. All of the incoming events are
put into a standard format and then stored. We have event logs for every-
thing—bid requests, impressions, clicks, conversions, all the visitation data, and
so on. I want everything logged. We record them as event logs, with certain
lookback times and fields.
They are housed in a Hadoop cluster, on top of which we have Apache Hive.
Hive is a tool that basically lets you query this data with more or less standard
SQL. It is not necessarily a real-time response. It is a little bit slow because of
the whole interaction with Hadoop, but I do not need real time. I just need to
get the data that I want. So I use Hive to get data out of Hadoop.
The key to working with this data is to figure out what exact data sample you
need, so it is about figuring out which Hive query will give it to you. Typically,
I try to avoid going beyond 10 GB of data. For most things I need to do, I can
downsample significantly, as I do not need to process all the data. Once I have
figured out which downsampled data set I need, then I just need to write the
correct query to get the piece that I need.
Gutierrez: What kind of tools do you use to preprocess the data?
Perlich: I use a lot of UNIX tools—sed, awk, sort, grep, and others. You name
it, I probably use it. I also write a lot of my own code in Perl. I do a lot of script-
ing that runs over that data. The scripting is done not so much for analytics.
Rather it is done to preprocess the data into a state where I can then run it
through some special-purpose tool.
Gutierrez: What types of special-purpose tools have you built?
Perlich: For the hard-core modeling that we do, we have our own implemen-
tation of a stochastic gradient descent logistic regression. That thing takes
something along the lines of 10 million examples with 10 million features and
within 5 to 10 minutes you get an answer. It is not parallelized, but it is really
to the point and implemented very well for our specific use case of dealing
with this kind of sparse data.
We are very much a self-made shop, so we are not using any kind of com-
mercial tooling. So we build our own specialized solutions. When we need
something, I typically start digging around in the academic literature and say
something like, “Okay, let's see what SVMlight (some specific implementation
of a Support Vector Machine Algorithm out of Cornell) is doing.” I first check
on performance, and even if it takes three hours, that is fine. Maybe we try ran-
dom forest. I can get examples of this code, and then we see what works well
 
Search WWH ::




Custom Search