Jonathan Lenaghan - Data Scientists at Work

Database Reference

In-Depth Information

a month's worth. At each step we are testing to see how the prototype is

performing. Depending on the project or the product, varying lengths of his-

tory are required for further testing. Interestingly, it is definitely the case that

more data is not always better. It depends on the product.

Gutierrez: The testing depends on the product, data history, and what you

are modeling. If it is a prototype for the December holiday season, you do not

want to use data from the middle of the summer.

Lenaghan: That's exactly right.

Gutierrez: Are there any interesting aspects of the data sets outside of the

most obvious information content?

Lenaghan: It turns out that a lot of the biases in the data appear from the

fact that all of the movement data comes from smartphones. This means you

are completely biased toward people who own smartphones. This is a large

population, as there are about 110 million smartphones in the USA right now.

Although this represents a large swath of the US population, it is still a biased

sample. So we have to deal with that bias in the data.

Gutierrez: Are there other large biases that you need to take into

account?

Lenaghan: The movement histories that we see also have a large bias, as

these phones don't drop 5-minute breadcrumbs all the time. They are only

engaged when someone is using an ad-supported app, for example—so you

also have a bias there, which means you end up biasing toward people who

use ad-supported apps. In fact, biases pop up for different ad-supported apps

people use all the time, such as texting apps, Words with Friends, or other

apps. So free texting apps tend to skew in one direction. Words with Friends-

type apps—even my mom uses Words with Friends—tend to skew in another

direction. In interpreting our data, we have to correct for these and many

other sorts of biases all the time.

Gutierrez: Let's dig a little deeper into the biases. Did you and your col-

leagues figure them out, or are these biases industry-known demographic,

sociographic, and/or psychographic heuristics?

Lenaghan: That it is something we have figured out internally. Something

we're always very cognizant of is that we don't want to be an undifferenti-

ated black-box machine learning platform. So a very large component of the

bias-correcting work we do is based on social anthropology. We look at the

movement data and ask people in our office with a background in anthropol-

ogy or sociology questions to gain further understanding.

We want really to understand: “How do we interpret this data? It's biased

in this way. Why is that?” A great deal of the time the data is not going to

answer these questions. A key thing is to never underestimate the power of

domain-specific knowledge.

Search WWH ::

Custom Search

Home