Database Reference
In-Depth Information
encounter people running Hadoop and they are excited to tell me that they
now have all of their data in HDFS. I ask them how much data they have and
if it is structured. I'm always amazed when they tell me that it's a few gigs of
structured data. That size and type of data could fit into a tiny free SQLite
database. This tells me that they encountered a very good salesperson and
they haven't actually thought through the problem they are solving.
If you do it this way, which is backward, it's a lot like most people's New Year's
resolution for getting healthy and losing weight. It's January 1 st , and I go get a
gym membership and buy a bunch of workout gear and new clothes. What
have I done? Nothing. I'm just as fat as I've always been, but I feel like I'm mak-
ing progress because I've spent money and bought things. That's how I see the
businesses that go out and procure tools. They say to themselves, “We've got
to do big data and we've got to do data science, so let's go get tools and get
consultants, and then we'll be ready to go.” And before they know it, all they
have to show for it is a bunch of money spent, a bunch of tooling, and maybe
an infographic, because they never took the time to do the one thing that's
very hard to show progress on, which is thinking. They never sat down and
thought through: What problems should we be attacking? What data do we
have, and how should we attack these problems given the data that we have?
Instead, they went out and spent their budget, because that's a great way to
show you're doing something. You're spending money. Something must be
happening. Everyone's waiting for someone else to make something happen
while they spend the money.
We're different and very conservative in the sense that the way I think about
tools is problem-focused. We start with the problem we want to solve or a
general understanding of several problems we want to solve. Then we take
stock of the data that's available to us. We think about the techniques that
are available to us. We think about the technologies that are available to us.
And then, and only then, do we select the technologies that are going to solve
those problems.
For instance, on some of the AI models we built for compliance, there are
some really sexy tools that we could have used. However, what we realized
is that all the data we cared about for these models was already structured.
And because it was already structured, it already worked well within an SQL
context. Furthermore, a lot of the queries we needed to run for the training
sets were queries that were best accomplished via SQL window functions, as
we were looking at a lot of lagged time-series data. So once we hit that point,
we realized that it would it fine in a sharded PostgreSQL database, as the
data was probably smaller than 30 terabytes. Having realized this, we asked
ourselves why would we need something else? This is a tool that's very robust
and stable. It's a tool that the devs know how to work with really well. They
can spin it up fast. Our compliance team needs our models yesterday. Why
would I choose to go after tools that are a little bit less stable but sexier and
 
Search WWH ::




Custom Search