Scalable Parallel Processing with MapReduce - Professional NoSQL - page 218

Databases Reference

In-Depth Information

being good general-purpose solutions. They are expensive to build and maintain and out of reach

for most organizations.

Grid computing emerged as a possible solution for a problem that super computers didn't solve.

The idea behind a computational grid is to distribute work among a set of nodes and thereby

reduce the computational time that a single large machine takes to complete the same task. In grid

computing, the focus is on compute-intensive tasks where data passing between nodes is managed

using Message Passing Interface (MPI) or one of its variants. This topology works well where the

extra CPU cycles get the job done faster. However, this same topology becomes ineffi cient if a

large amount data needs to be passed among the nodes. Large data transfer among nodes faces I/O

and bandwidth limitations and can often be bound by these restrictions. In addition, the onus of

managing the data-sharing logic and recovery from failed states is completely on the developer.

Public computing projects like SETI@Home ( http://setiathome.berkeley.edu/ ) and Folding@

Home ( http://folding.stanford.edu/ ) extend the idea of grid computing to individuals

donating “spare” CPU cycles for compute-intensive tasks. These projects run on idle CPU time

of hundreds of thousands, sometimes millions, of individual machines, donated by volunteers.

These individual machines go on and off the Internet and provide a large compute cluster despite

their individual unreliability. By combining idle CPUs, the overall infrastructure tends to work like,

and often smarter than, a single super computer.

Despite the availability of varied solutions for effective distributed computing, none listed so far keep

data locally in a compute grid to minimize bandwidth blockages. Few follow a policy of sharing little

or nothing among the participating nodes. Inspired by functional programming notions that adhere

to ideas of little interdependence among parallel processes, or threads, and committed to keeping

data and computation together, is MapReduce. Developed for distributed computing and patented by

Google, MapReduce has become one of the most popular ways of processing large volumes of data

effi ciently and reliably. MapReduce offers a simple and fault-tolerant model for effective computation

on large data spread across a horizontal cluster of commodity nodes. This chapter explains

MapReduce and explores the many possible computations on big data using this programming model.

MapReduce is explicitly stated as MapReduce, a camel-cased version used and

popularized by Google. However, the coverage here is more generic and not

restricted by Google's defi nition.

The idea of MapReduce is published in a research paper, which is accessible

online at http://labs.google.com/papers/mapreduce.html (Dean, Jeffrey &

Ghemawat, Sanjay (2004), “MapReduce: Simplifi ed Data Processing on Large

Clusters”).

UNDERSTANDING MAPREDUCE

Chapter 6 introduced MapReduce as a way to group data on MongoDB clusters. Therefore, MapReduce

isn't a complete stranger to you. However, to explain the nuances and idioms of MapReduce, I

reintroduce the concept using a few illustrated examples.

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home