Victor Hu - Data Scientists at Work

Database Reference

In-Depth Information

We had to go back and figure out how to deal with that, so it was very much

an iterative process. That was where a lot of the statistical testing comes in,

and you can see that the fact that someone does not have a network con-

nected actually does provide a lot of information. Eventually, I had to code that

in—the presence of a network is one of the predictor variables. So that is one

interesting and kind of unusual aspect to the music data that we discovered

during the modeling process.

Gutierrez: What kind of tools do you use in your data stack?

Hu: We are primarily an R shop in terms of the data analysis. Our fullstack

is mostly in Java and PHP, though the modeling is done in R or Python. Our

data is stored in HBase, and then we pull it out with Pig and store it usually in

Mongo databases for events. A lot of times we will do SQL databases for time

series, just to make the data science easier. And then the visualization is done

in different things. Sometimes we will use R, and other times we will use D3.js.

Actually, one of our big pushes right now is to do more D3.js visualizations.

What tools we use evolves very quickly. Just a couple of months ago, all of

our data was stored in Cassandra. We made the shift to HBase literally in this

last month or so. I have now been using Pig and Hive with our more Hadoop-

oriented data backend. I am sure next year we will be using something different

or the tools will have evolved into something different. So the speed at which

new technology is coming out is really astounding.

At a conference I recently went to, PrestoDB was one of the new technolo-

gies, widely touted as an even faster version of Hive. We always struggle with

connecting R or Python with the Java back end and the PHP front end. So

there are different ways to do that based on different technologies that are

coming out. It's all about using what works, what you need at that moment in

time, and not necessarily worrying about two years down the road because

everything will have shifted by then.

Gutierrez: Given the fast pace of change, how do you think about hiring or

integrating someone new into the team?

Hu: Hiring data scientists is very exciting at this time because in some ways

there are no established guidelines on how to do it. People have skills in so

many different areas. I know when we were hiring our second data scientist

I had specific things that I was looking for. My philosophy at the time was to

hire someone who could do things that I could not do, or at least had a big

spectrum of knowledge that I have very little in, so that I and the entire team

could learn and benefit. Together we would be complementary pieces.

I think that is what we always strive for when we hire somebody, data

scientists or not—we look for people who are very intelligent and can learn

on the fly. I think that is a big component of data science today, because

nobody knows all the answers. There is not necessarily an established

Search WWH ::

Custom Search

Home