John Foreman - Data Scientists at Work

Database Reference

In-Depth Information

encounter people running Hadoop and they are excited to tell me that they

now have all of their data in HDFS. I ask them how much data they have and

if it is structured. I'm always amazed when they tell me that it's a few gigs of

structured data. That size and type of data could fit into a tiny free SQLite

database. This tells me that they encountered a very good salesperson and

they haven't actually thought through the problem they are solving.

If you do it this way, which is backward, it's a lot like most people's New Year's

resolution for getting healthy and losing weight. It's January 1 st , and I go get a

gym membership and buy a bunch of workout gear and new clothes. What

have I done? Nothing. I'm just as fat as I've always been, but I feel like I'm mak-

ing progress because I've spent money and bought things. That's how I see the

businesses that go out and procure tools. They say to themselves, “We've got

to do big data and we've got to do data science, so let's go get tools and get

consultants, and then we'll be ready to go.” And before they know it, all they

have to show for it is a bunch of money spent, a bunch of tooling, and maybe

an infographic, because they never took the time to do the one thing that's

very hard to show progress on, which is thinking. They never sat down and

thought through: What problems should we be attacking? What data do we

have, and how should we attack these problems given the data that we have?

Instead, they went out and spent their budget, because that's a great way to

show you're doing something. You're spending money. Something must be

happening. Everyone's waiting for someone else to make something happen

while they spend the money.

We're different and very conservative in the sense that the way I think about

tools is problem-focused. We start with the problem we want to solve or a

general understanding of several problems we want to solve. Then we take

stock of the data that's available to us. We think about the techniques that

are available to us. We think about the technologies that are available to us.

And then, and only then, do we select the technologies that are going to solve

those problems.

For instance, on some of the AI models we built for compliance, there are

some really sexy tools that we could have used. However, what we realized

is that all the data we cared about for these models was already structured.

And because it was already structured, it already worked well within an SQL

context. Furthermore, a lot of the queries we needed to run for the training

sets were queries that were best accomplished via SQL window functions, as

we were looking at a lot of lagged time-series data. So once we hit that point,

we realized that it would it fine in a sharded PostgreSQL database, as the

data was probably smaller than 30 terabytes. Having realized this, we asked

ourselves why would we need something else? This is a tool that's very robust

and stable. It's a tool that the devs know how to work with really well. They

can spin it up fast. Our compliance team needs our models yesterday. Why

would I choose to go after tools that are a little bit less stable but sexier and

Data Scientists at Work

Search WWH ::

Custom Search

Home