The Story of Big Data at Google - Google BigQuery Analytics

Database Reference

In-Depth Information

optimized to run as a series of MapReduce operations. This solves the

ease of setup, ease of writing, and ease of handling multiple MapReduce

problems previously mentioned.

• Dremel : A distributed SQL query engine that can perform complex

queries over data stored on Colossus, GFS, or elsewhere.

The version 2.0 stack, built piecemeal on top of the version 1.0 stack

(Megastore is built on top of Bigtable, for instance), addresses many of the

drawbacks of the previous version. For instance, Megastore allows services

to write from any datacenter and know that other readers will read the most

up-to-date version. Spanner, in many ways, is a successor to Megastore,

which adds automatic planet-scale replication and data provenance

protection.

On the data processing side, batch processing and interactive analyses were

separated into two tools based on usage models: Flume and Dremel. Flume

enables users to easily chain together MapReduces and provides a simpler

programming model to perform batch operations over Big Data. Dremel, on

the other hand, makes it easy to ask questions about Big Data because you

can now run a SQL query over terabytes of data and get results back in a few

seconds. Dremel is the query engine that powers BigQuery; Its architecture

is discussed in detail in Chapter 9, “Understanding Query Execution.”

An interesting consequence of the version 2.0 stack is that it explicitly

rejects the notion that in order to use Big Data you need to solve your

problems in fundamentally different ways than you're used to. While

MapReduce required you to think about your computation in terms of Map

and Reduce phases, FlumeJava allows you to write code that looks like you

are operating over normal Java collections. Bigtable replication required

abandoning consistent writes, but Megastore adds a consistent coordination

layer on top. And while Bigtable had improved scalability by disallowing

queries, Dremel retrofits a traditional SQL query interface onto Big Data

structured storage.

There are still rough edges around many of the Big Data 2.0 technologies:

things that you expect to be able to do but can't, things that are slow but

seem like they should be fast, and cases where they hold onto awkward

abstractions. However, as time goes on, the trend seems to be towards

smoothing those rough edges and making operation over Big Data as

seamless as over smaller data.

Search WWH ::

Custom Search

Home