Jonathan Lenaghan - Data Scientists at Work

Database Reference

In-Depth Information

Overall, the project was a number of steps with problems and solutions along

the way. The problems tended to get smaller and smaller and smaller as we

made progress, but at the end of the day it was still a very large problem to

solve. This is a project I am proud of, as I built most of it and this is something

we now run on a daily basis—although at first it took a long time to run.

Gutierrez: Did it take a long time to run because it was a prototype?

Lenaghan: It was written in Java because we knew that it would be some-

thing that would have to be very performant. So we built the prototype to

show that it worked and scaled. Once we showed that it worked even though

it took a long time to run, we handed that over to engineering, because there

is a lot of configuration that is involved in that as well. At that point, because

it was not as performant as it could have been, we had one of our young

rock-star engineers make it very fast and efficient.

Gutierrez: How is the data stored?

Lenaghan: This is a very important two-pronged question for PlaceIQ. The

first prong and priority is to store the very sensitive location data in a way

that maintains as much privacy for people as possible. The last thing we want

to do is have a scandal. When we talk about this large join between location

history and our geospatial layer, we never actually store the device IDs. Even

though the device IDs are already obfuscated and hashed when we use them,

we are super careful to never actually store them.

When ingesting data, we get the location and device ID from ad-request logs.

However, once we join it against our base data layer, we drop the location. So

it is stored in the format of obfuscated device ID, context, and timestamp. So it

will be device123/Walmart/Wednesday, December 17, 3 P.M. Note that in this

format we do not specify which Walmart it is, just that it is a Walmart. We

never store any information about which Walmart it was; so we do not know

if the Walmart is a San Francisco-area Walmart, a New York-area Walmart,

or a Walmart somewhere else.

We are always very careful with any of our derived data that we never store

any type of identifier—device ID, IP address, or similar data—and any sort of

raw location data. We keep a very strict information wall between those data

sets. So our data is stored as device ID and the context in which the device

was, but not exactly where the device was. Our rules are built out specifically

so that we only query on context and times.

Gutierrez: And the second prong?

Lenaghan: The second prong is technical in nature because of the size of

data we are using. So it important to us to think about how to store, retrieve,

and analyze this data. Right now, our entire infrastructure is hosted on

Amazon's S3 service. Within a month, we will have moved to a colocation data

Search WWH ::

Custom Search

Home