Databases Reference
In-Depth Information
being good general-purpose solutions. They are expensive to build and maintain and out of reach
for most organizations.
Grid computing emerged as a possible solution for a problem that super computers didn't solve.
The idea behind a computational grid is to distribute work among a set of nodes and thereby
reduce the computational time that a single large machine takes to complete the same task. In grid
computing, the focus is on compute-intensive tasks where data passing between nodes is managed
using Message Passing Interface (MPI) or one of its variants. This topology works well where the
extra CPU cycles get the job done faster. However, this same topology becomes ineffi cient if a
large amount data needs to be passed among the nodes. Large data transfer among nodes faces I/O
and bandwidth limitations and can often be bound by these restrictions. In addition, the onus of
managing the data-sharing logic and recovery from failed states is completely on the developer.
Public computing projects like SETI@Home ( http://setiathome.berkeley.edu/ ) and Folding@
Home ( http://folding.stanford.edu/ ) extend the idea of grid computing to individuals
donating “spare” CPU cycles for compute-intensive tasks. These projects run on idle CPU time
of hundreds of thousands, sometimes millions, of individual machines, donated by volunteers.
These individual machines go on and off the Internet and provide a large compute cluster despite
their individual unreliability. By combining idle CPUs, the overall infrastructure tends to work like,
and often smarter than, a single super computer.
Despite the availability of varied solutions for effective distributed computing, none listed so far keep
data locally in a compute grid to minimize bandwidth blockages. Few follow a policy of sharing little
or nothing among the participating nodes. Inspired by functional programming notions that adhere
to ideas of little interdependence among parallel processes, or threads, and committed to keeping
data and computation together, is MapReduce. Developed for distributed computing and patented by
Google, MapReduce has become one of the most popular ways of processing large volumes of data
effi ciently and reliably. MapReduce offers a simple and fault-tolerant model for effective computation
on large data spread across a horizontal cluster of commodity nodes. This chapter explains
MapReduce and explores the many possible computations on big data using this programming model.
MapReduce is explicitly stated as MapReduce, a camel-cased version used and
popularized by Google. However, the coverage here is more generic and not
restricted by Google's defi nition.
The idea of MapReduce is published in a research paper, which is accessible
online at http://labs.google.com/papers/mapreduce.html (Dean, Jeffrey &
Ghemawat, Sanjay (2004), “MapReduce: Simplifi ed Data Processing on Large
Clusters”).
UNDERSTANDING MAPREDUCE
Chapter 6 introduced MapReduce as a way to group data on MongoDB clusters. Therefore, MapReduce
isn't a complete stranger to you. However, to explain the nuances and idioms of MapReduce, I
reintroduce the concept using a few illustrated examples.
Search WWH ::




Custom Search