Jonathan Lenaghan - Data Scientists at Work

Database Reference

In-Depth Information

center facility. The colo will help in storing location data that is very sensitive.

Technically, all of the data will be stored in Apache's Hadoop Distributed File

System [HDFS].

Gutierrez: As your team expands, what types of people are you looking for

and how do you actually know that they are good?

Lenaghan: When we are looking for people, we are looking for very pas-

sionate people who are quantitatively minded. Even though we use Hadoop

a lot here, being an expert in Hadoop is not a job requirement. We want

people who can think logically, scientifically, and quantitatively about problems.

We want them to be able to accurately identify what works and does not

work. We also want them to know why things do not work, even though they

thought they were going to work. Being self-critical is important.

Our interview process consists more of probing to understand how they

think rather than, “How would you do this particular graph algorithm in a

map-reduce framework?” We are interested more in raw skills than in par-

ticular skills for our data science team. Whether we are hiring a junior hire or

a senior hire, we are looking for that quantitative piece. We have hired people

on the junior level who have very little programming/software engineering

experience. They had to learn those skills on the job and now they are writing

fantastic code. So hiring based on raw ability rather than specific experience

has not been a problem at all. That said, we occasionally need a very special-

ized person for a very specialized task, but that is the exception to our usual

hiring practices.

Gutierrez: Are there any tools not currently in your workflow that you are

excited about?

Lenaghan: One of the technologies we are looking at is Julia. One of the

projects a particular guy on the data science team is working on is figuring

out where we can use Julia in our workflow. Right now, because we are on

Amazon, we pay for the compute time. So we definitely want to cut down our

compute costs as much as possible. Once we move into the colo, it will be less

of a concern, but we still want to cut down our compute times.

We run many processes hundreds of billions of times a month. When you are

running algorithms on ad-request logs, even something as simple as convert-

ing from a latitude and longitude to a tile makes a big difference in compute

times and costs. Making these types of very small changes is important in our

work, so we are always looking for more performant numerical techniques.

Julia looks very promising in this area, so that is why we have a person working

on figuring out how to include it in our workflow.

I would also like to learn more about Clojure. I think the fewer lines of code

that you have to write, the better. Just looking at some Clojure projects, it

seems very promising to me. Functional programming languages lend them-

selves very well to things we do a great deal of—such as distributed computing,

Search WWH ::

Custom Search

Home