Database Reference
In-Depth Information
Overall, the project was a number of steps with problems and solutions along
the way. The problems tended to get smaller and smaller and smaller as we
made progress, but at the end of the day it was still a very large problem to
solve. This is a project I am proud of, as I built most of it and this is something
we now run on a daily basis—although at first it took a long time to run.
Gutierrez: Did it take a long time to run because it was a prototype?
Lenaghan: It was written in Java because we knew that it would be some-
thing that would have to be very performant. So we built the prototype to
show that it worked and scaled. Once we showed that it worked even though
it took a long time to run, we handed that over to engineering, because there
is a lot of configuration that is involved in that as well. At that point, because
it was not as performant as it could have been, we had one of our young
rock-star engineers make it very fast and efficient.
Gutierrez: How is the data stored?
Lenaghan: This is a very important two-pronged question for PlaceIQ. The
first prong and priority is to store the very sensitive location data in a way
that maintains as much privacy for people as possible. The last thing we want
to do is have a scandal. When we talk about this large join between location
history and our geospatial layer, we never actually store the device IDs. Even
though the device IDs are already obfuscated and hashed when we use them,
we are super careful to never actually store them.
When ingesting data, we get the location and device ID from ad-request logs.
However, once we join it against our base data layer, we drop the location. So
it is stored in the format of obfuscated device ID, context, and timestamp. So it
will be device123/Walmart/Wednesday, December 17, 3 P.M. Note that in this
format we do not specify which Walmart it is, just that it is a Walmart. We
never store any information about which Walmart it was; so we do not know
if the Walmart is a San Francisco-area Walmart, a New York-area Walmart,
or a Walmart somewhere else.
We are always very careful with any of our derived data that we never store
any type of identifier—device ID, IP address, or similar data—and any sort of
raw location data. We keep a very strict information wall between those data
sets. So our data is stored as device ID and the context in which the device
was, but not exactly where the device was. Our rules are built out specifically
so that we only query on context and times.
Gutierrez: And the second prong?
Lenaghan: The second prong is technical in nature because of the size of
data we are using. So it important to us to think about how to store, retrieve,
and analyze this data. Right now, our entire infrastructure is hosted on
Amazon's S3 service. Within a month, we will have moved to a colocation data
 
Search WWH ::




Custom Search