Database Reference
In-Depth Information
optimized to run as a series of MapReduce operations. This solves the
ease of setup, ease of writing, and ease of handling multiple MapReduce
problems previously mentioned.
Dremel : A distributed SQL query engine that can perform complex
queries over data stored on Colossus, GFS, or elsewhere.
The version 2.0 stack, built piecemeal on top of the version 1.0 stack
(Megastore is built on top of Bigtable, for instance), addresses many of the
drawbacks of the previous version. For instance, Megastore allows services
to write from any datacenter and know that other readers will read the most
up-to-date version. Spanner, in many ways, is a successor to Megastore,
which adds automatic planet-scale replication and data provenance
protection.
On the data processing side, batch processing and interactive analyses were
separated into two tools based on usage models: Flume and Dremel. Flume
enables users to easily chain together MapReduces and provides a simpler
programming model to perform batch operations over Big Data. Dremel, on
the other hand, makes it easy to ask questions about Big Data because you
can now run a SQL query over terabytes of data and get results back in a few
seconds. Dremel is the query engine that powers BigQuery; Its architecture
is discussed in detail in Chapter 9, “Understanding Query Execution.”
An interesting consequence of the version 2.0 stack is that it explicitly
rejects the notion that in order to use Big Data you need to solve your
problems in fundamentally different ways than you're used to. While
MapReduce required you to think about your computation in terms of Map
and Reduce phases, FlumeJava allows you to write code that looks like you
are operating over normal Java collections. Bigtable replication required
abandoning consistent writes, but Megastore adds a consistent coordination
layer on top. And while Bigtable had improved scalability by disallowing
queries, Dremel retrofits a traditional SQL query interface onto Big Data
structured storage.
There are still rough edges around many of the Big Data 2.0 technologies:
things that you expect to be able to do but can't, things that are slow but
seem like they should be fast, and cases where they hold onto awkward
abstractions. However, as time goes on, the trend seems to be towards
smoothing those rough edges and making operation over Big Data as
seamless as over smaller data.
Search WWH ::




Custom Search