The Story of Big Data at Google - Google BigQuery Analytics

Database Reference

In-Depth Information

As anyone who has spent time tuning a relational database can attest, there

is a lot of black magic involved in getting queries to run quickly on

your-favorite-database. You may need to add indexes, stripe data across

disks, put the transaction log on its own spindle, and so on. However, as

your data grows, at some point it gets harder and harder to make your

queries perform well. In addition, the more work you do, the more you end

up specializing the schema for the type of questions you typically ask of your

data.

What if you want to ask a question you've never asked before? If you are

relying on a heavily tuned schema, or if you're running different queries

than the database was tuned for, you may not get answers in a reasonable

amount of time or without bogging down your production database. In these

cases, your options are limited; you either need to run an extremely slow

query (that may degrade performance for your entire database), or you

could export the data and process it in an external system like Hadoop.

Often, to get queries to run quickly, people sample their data—they keep

only 10 percent of user impressions, for example. But what happens if you

want to explore the data in a way that requires access to all the impressions?

Maybe you want to compute the number of distinct users that visited your

site—if you drop 90 percent of your data, you can't just multiply the

remaining users by 10 to get the number of distinct users in the original

dataset. This point is somewhat subtle, but if you drop 90 percent of your

data, you might still have records representing 99 percent of your users, or

you might have records representing only 5 percent of your users; you can't

tell unless you use a more sophisticated way to filter your data.

How Can You Read a Terabyte in a Second?

If you want to ask interactive questions of your Big Data, you must process

all your data within a few seconds. That means you need to read hundreds

of gigabytes per second—and ideally more.

Following are three ways that you can achieve this type of data rate:

1. Skip a lot of the data. This is a good option if you know in advance the

types of questions you're going to ask. You can pre-aggregate the data or

create indexes on the columns that you need to access. However, if you

want to ask different questions, or ask them in a different way, you may

not be able to avoid reading everything.

Search WWH ::

Custom Search

Home