Database Reference
In-Depth Information
a month's worth. At each step we are testing to see how the prototype is
performing. Depending on the project or the product, varying lengths of his-
tory are required for further testing. Interestingly, it is definitely the case that
more data is not always better. It depends on the product.
Gutierrez: The testing depends on the product, data history, and what you
are modeling. If it is a prototype for the December holiday season, you do not
want to use data from the middle of the summer.
Lenaghan: That's exactly right.
Gutierrez: Are there any interesting aspects of the data sets outside of the
most obvious information content?
Lenaghan: It turns out that a lot of the biases in the data appear from the
fact that all of the movement data comes from smartphones. This means you
are completely biased toward people who own smartphones. This is a large
population, as there are about 110 million smartphones in the USA right now.
Although this represents a large swath of the US population, it is still a biased
sample. So we have to deal with that bias in the data.
Gutierrez: Are there other large biases that you need to take into
account?
Lenaghan: The movement histories that we see also have a large bias, as
these phones don't drop 5-minute breadcrumbs all the time. They are only
engaged when someone is using an ad-supported app, for example—so you
also have a bias there, which means you end up biasing toward people who
use ad-supported apps. In fact, biases pop up for different ad-supported apps
people use all the time, such as texting apps, Words with Friends, or other
apps. So free texting apps tend to skew in one direction. Words with Friends-
type apps—even my mom uses Words with Friends—tend to skew in another
direction. In interpreting our data, we have to correct for these and many
other sorts of biases all the time.
Gutierrez: Let's dig a little deeper into the biases. Did you and your col-
leagues figure them out, or are these biases industry-known demographic,
sociographic, and/or psychographic heuristics?
Lenaghan: That it is something we have figured out internally. Something
we're always very cognizant of is that we don't want to be an undifferenti-
ated black-box machine learning platform. So a very large component of the
bias-correcting work we do is based on social anthropology. We look at the
movement data and ask people in our office with a background in anthropol-
ogy or sociology questions to gain further understanding.
We want really to understand: “How do we interpret this data? It's biased
in this way. Why is that?” A great deal of the time the data is not going to
answer these questions. A key thing is to never underestimate the power of
domain-specific knowledge.
 
Search WWH ::




Custom Search