Caitlin Smallwood - Data Scientists at Work

Database Reference

In-Depth Information

At Netflix, it's quite the opposite, from the standpoint that we are very inten-

tional about what data we capture and how we capture it. We try to embed

any critical business logic that we know we're going to want to apply, no mat-

ter what we do with the data, at the point of capture. This makes just about

everything much easier. Sometimes we get that wrong. And more work falls

to the engineering teams when that's the case because then they have to go

back, detangle what we did, and rework something new, which can be painful.

But you're going to have that no matter what at some layer of the data stack,

and we have found that it's easier to do that at the source where you can.

Now, you can't always do that, because you want granular data that you can

aggregate in as many ways as you want to later when you think of new ideas.

But there are certain things you know that you really don't need. We try to

weed out as strongly as possible in things, but you're never one hundred per-

cent right. You're occasionally going to have to go back and rework things.

You just want to try to minimize that.

Gutierrez: How do you think about the technology selection for the data stack?

Smallwood: This is a hard one because technology, especially in the data

space, evolves more quickly than most companies can evolve. This is true

especially at the data warehousing level, whether it's in the cloud or dedicated

warehousing. There are so many different broad mechanisms, and once you've

built a lot of infrastructure within your company, it's incredibly expensive to

switch over to some new technology.

We use Teradata for a large part of our data warehousing. If we wanted to

move from Teradata to some other data-center-oriented warehousing sys-

tem, we would have so much to move that it would be a year's worth of work

for the entire data organization. Perhaps not quite that much, but it would be

a lot of work. So the farther upstream you are in your stack, the harder it is, I

think, to change technologies.

That said, we have been, like many companies, moving more and more toward

cloud-based analytics. When I first started at Netflix, pretty much all of our

data was in Teradata and we had a little bit of data in the cloud. Netflix was

just moving toward serving our whole product off of the cloud, so as you can

imagine, that meant we started having more and more data in the cloud. With

this evolution of the product occurring, what has worked for us is doing par-

allel development. We've moved from having the majority of our analytics in

Teradata to now having the majority of our analytics in the cloud.

We do still have a formidable amount of data in Teradata, but we've switched

our philosophy. We have aggregate data we use for ongoing reporting in

Teradata. We have granular data we use more for the modeling in the cloud,

and all sorts of analytics go in both places. But we have more data in the cloud,

as it's closer to the point of capture. And it's worked really nicely for us to

Search WWH ::

Custom Search

Home