Database Reference
In-Depth Information
This metric is useful for three reasons. First, it is consistent over time. Second,
it is error-independent for the rankings over time for the relevant artists we
are concerned with. And three, the data is available - both historically and
going forward. Though the data is available, it did take a lot of wrangling to
get that data. Once we have the data and success metric, we can then make
a prediction. All of the feature selection that went into modeling was also a
unique challenge within itself. But the key that really set us off on that path was
being able to define what we wanted to predict.
Gutierrez: What work did you have to do before you were able to create
the model?
Hu: Leading up to the project, we had done a lot of work producing the
backend, loading all of our time-series data into a database that can be queried
for who the top artists are in terms of plays, in terms of growths of plays, in
terms of totals across the networks, and other similar queries. So you can
quickly pull up the top ten artists, the top million artists, ranked in order,
rather than having to go artist by artist, which is how our data is stored.
Once we had that in place and we had the framework for what we wanted to
predict, then you could get to the juicy stuff.
I want to emphasize how much setup is involved in getting to the point where
you can actually do the modeling. You have to think about what question
you want to answer, as well as what question you can answer with the data.
So many people, I think, neglect to think about how long that takes and what
industry-specific knowledge, as well as knowledge of your own data that this
takes. So that was an important lesson.
Gutierrez: Once you arrived at the modeling stage, what was the process
like?
Hu: The modeling was definitely an iterative process. We started off with
throwing theoretical models at it, and quickly realized that there were a lot
of things we had not accounted for in the initial thinking. For example, most
artists do not have all the social media networks set up and connected. So
you get this unusual data artifact that, for each row of data about an artist, you
only have a couple of metrics for that artist, and it varies across the whole
universe of artists.
Further, it is a little bit unclear whether that is systematic or not, whether that
is indicative of anything, or simply that the artist has not gone on that network
yet, so that is why they do not have any data. So that was definitely an unusual
aspect of the data. I realized it when I ran the model, and all of a sudden, all
of these artists who did not have certain networks connected were showing
up really low—like Kanye West did not have Facebook or a similar network
connected, so his predictions were really low, and that obviously did not make
any sense.
 
Search WWH ::




Custom Search