Victor Hu - Data Scientists at Work

Database Reference

In-Depth Information

This metric is useful for three reasons. First, it is consistent over time. Second,

it is error-independent for the rankings over time for the relevant artists we

are concerned with. And three, the data is available - both historically and

going forward. Though the data is available, it did take a lot of wrangling to

get that data. Once we have the data and success metric, we can then make

a prediction. All of the feature selection that went into modeling was also a

unique challenge within itself. But the key that really set us off on that path was

being able to define what we wanted to predict.

Gutierrez: What work did you have to do before you were able to create

the model?

Hu: Leading up to the project, we had done a lot of work producing the

backend, loading all of our time-series data into a database that can be queried

for who the top artists are in terms of plays, in terms of growths of plays, in

terms of totals across the networks, and other similar queries. So you can

quickly pull up the top ten artists, the top million artists, ranked in order,

rather than having to go artist by artist, which is how our data is stored.

Once we had that in place and we had the framework for what we wanted to

predict, then you could get to the juicy stuff.

I want to emphasize how much setup is involved in getting to the point where

you can actually do the modeling. You have to think about what question

you want to answer, as well as what question you can answer with the data.

So many people, I think, neglect to think about how long that takes and what

industry-specific knowledge, as well as knowledge of your own data that this

takes. So that was an important lesson.

Gutierrez: Once you arrived at the modeling stage, what was the process

like?

Hu: The modeling was definitely an iterative process. We started off with

throwing theoretical models at it, and quickly realized that there were a lot

of things we had not accounted for in the initial thinking. For example, most

artists do not have all the social media networks set up and connected. So

you get this unusual data artifact that, for each row of data about an artist, you

only have a couple of metrics for that artist, and it varies across the whole

universe of artists.

Further, it is a little bit unclear whether that is systematic or not, whether that

is indicative of anything, or simply that the artist has not gone on that network

yet, so that is why they do not have any data. So that was definitely an unusual

aspect of the data. I realized it when I ran the model, and all of a sudden, all

of these artists who did not have certain networks connected were showing

up really low—like Kanye West did not have Facebook or a similar network

connected, so his predictions were really low, and that obviously did not make

any sense.

Search WWH ::

Custom Search

Home