The Story of Big Data at Google - Google BigQuery Analytics

Database Reference

In-Depth Information

• MapReduce can be slow. If you want to ask questions of your data, you

have to wait minutes or hours to get the answers. Moreover, you have to

write custom C++ or Java code each time you want to change the

question that you're asking.

• GFS, while improving durability of the data (since it is replicated

multiple times) can suffer from reduced availability, since the metadata

server is a single point of failure.

• Bigtable has problems in a multidatacenter environment. Most services

run in multiple locations; Bigtable replication between datacenters is

only eventually consistent (meaning that data that gets written out will

show up everywhere, but not immediately). Individual services spend a

lot of redundant effort babysitting the replication process.

• Programmers (even Google programmers) have a really difficult time

dealing with eventual consistency. This same problem occurred when

Intel engineers tried improving CPU performance by relaxing the

memory model to be eventually consistent; it caused lots of subtle bugs

because the hardware stopped working the way people's mental model

of it operated.

Over the next several years, Google built a number of additional

infrastructure components that refined the ideas from the 1.0 stack:

• Colossus : A distributed filesystem that works around many of the

limitations in GFS. Unlike many of the other technologies used at

Google, Colossus' architecture hasn't been publicly disclosed in research

papers.

• Megastore : A geographically replicated, consistent NoSQL-type

datastore. Megastore uses the Paxos algorithm to ensure consistent

reads and writes. This means that if you write data in one datacenter, it

is immediately available in all other datacenters.

• Spanner : A globally replicated datastore that can handle data locality

constraints, like “This data is allowed to reside only in European

datacenters.” Spanner managed to solve the problem of global time

ordering in a geographically distributed system by using atomic clocks

to guarantee synchronization to within a known bound.

• FlumeJava : A system that allows you to write idiomatic Java code that

runs over collections of Big Data. Flume operations get compiled and

Search WWH ::

Custom Search

Home