Database Reference
In-Depth Information
We had to go back and figure out how to deal with that, so it was very much
an iterative process. That was where a lot of the statistical testing comes in,
and you can see that the fact that someone does not have a network con-
nected actually does provide a lot of information. Eventually, I had to code that
in—the presence of a network is one of the predictor variables. So that is one
interesting and kind of unusual aspect to the music data that we discovered
during the modeling process.
Gutierrez: What kind of tools do you use in your data stack?
Hu: We are primarily an R shop in terms of the data analysis. Our fullstack
is mostly in Java and PHP, though the modeling is done in R or Python. Our
data is stored in HBase, and then we pull it out with Pig and store it usually in
Mongo databases for events. A lot of times we will do SQL databases for time
series, just to make the data science easier. And then the visualization is done
in different things. Sometimes we will use R, and other times we will use D3.js.
Actually, one of our big pushes right now is to do more D3.js visualizations.
What tools we use evolves very quickly. Just a couple of months ago, all of
our data was stored in Cassandra. We made the shift to HBase literally in this
last month or so. I have now been using Pig and Hive with our more Hadoop-
oriented data backend. I am sure next year we will be using something different
or the tools will have evolved into something different. So the speed at which
new technology is coming out is really astounding.
At a conference I recently went to, PrestoDB was one of the new technolo-
gies, widely touted as an even faster version of Hive. We always struggle with
connecting R or Python with the Java back end and the PHP front end. So
there are different ways to do that based on different technologies that are
coming out. It's all about using what works, what you need at that moment in
time, and not necessarily worrying about two years down the road because
everything will have shifted by then.
Gutierrez: Given the fast pace of change, how do you think about hiring or
integrating someone new into the team?
Hu: Hiring data scientists is very exciting at this time because in some ways
there are no established guidelines on how to do it. People have skills in so
many different areas. I know when we were hiring our second data scientist
I had specific things that I was looking for. My philosophy at the time was to
hire someone who could do things that I could not do, or at least had a big
spectrum of knowledge that I have very little in, so that I and the entire team
could learn and benefit. Together we would be complementary pieces.
I think that is what we always strive for when we hire somebody, data
scientists or not—we look for people who are very intelligent and can learn
on the fly. I think that is a big component of data science today, because
nobody knows all the answers. There is not necessarily an established
 
Search WWH ::




Custom Search