Database Reference
In-Depth Information
As anyone who has spent time tuning a relational database can attest, there
is a lot of black magic involved in getting queries to run quickly on
your-favorite-database. You may need to add indexes, stripe data across
disks, put the transaction log on its own spindle, and so on. However, as
your data grows, at some point it gets harder and harder to make your
queries perform well. In addition, the more work you do, the more you end
up specializing the schema for the type of questions you typically ask of your
data.
What if you want to ask a question you've never asked before? If you are
relying on a heavily tuned schema, or if you're running different queries
than the database was tuned for, you may not get answers in a reasonable
amount of time or without bogging down your production database. In these
cases, your options are limited; you either need to run an extremely slow
query (that may degrade performance for your entire database), or you
could export the data and process it in an external system like Hadoop.
Often, to get queries to run quickly, people sample their data—they keep
only 10 percent of user impressions, for example. But what happens if you
want to explore the data in a way that requires access to all the impressions?
Maybe you want to compute the number of distinct users that visited your
site—if you drop 90 percent of your data, you can't just multiply the
remaining users by 10 to get the number of distinct users in the original
dataset. This point is somewhat subtle, but if you drop 90 percent of your
data, you might still have records representing 99 percent of your users, or
you might have records representing only 5 percent of your users; you can't
tell unless you use a more sophisticated way to filter your data.
How Can You Read a Terabyte in a Second?
If you want to ask interactive questions of your Big Data, you must process
all your data within a few seconds. That means you need to read hundreds
of gigabytes per second—and ideally more.
Following are three ways that you can achieve this type of data rate:
1. Skip a lot of the data. This is a good option if you know in advance the
types of questions you're going to ask. You can pre-aggregate the data or
create indexes on the columns that you need to access. However, if you
want to ask different questions, or ask them in a different way, you may
not be able to avoid reading everything.
Search WWH ::




Custom Search