Databases Reference
In-Depth Information
Processing data spread across a cluster of horizontally scaled machines is complex. The MapReduce
model possibly provides one of the best possible methods to process large-scale data on a horizontal
cluster of machines.
Defi nition and Introduction
MapReduce is a parallel programming model that allows distributed processing on large data sets
on a cluster of computers. The MapReduce framework is patented ( http://patft.uspto.gov/
netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.
htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331 ) by Google, but the
ideas are freely shared and adopted in a number of open-source implementations.
MapReduce derives its ideas and inspiration from concepts in the world of functional programming.
Map and reduce are commonly used functions in the world of functional programming. In functional
programming, a map function applies an operation or a function to each element in a list. For
example, a multiply-by-two function on a list [1, 2, 3, 4] would generate another list as follows:
[2, 4, 6, 8]. When such functions are applied, the original list is not altered. Functional programming
believes in keeping data immutable and avoids sharing data among multiple processes or threads.
This means the map function that was just illustrated, trivial as it may be, could be run via two or
more multiple threads on the list and these threads would not step on each other, because the list
itself is not altered.
Like the map function, functional programming has a concept of a reduce function. Actually, a
reduce function in functional programming is more commonly known as a fold function. A reduce
or a fold function is also sometimes called an accumulate, compress, or inject function. A reduce or
fold function applies a function on all elements of a data structure, such as a list, and produces a
single result or output. So applying a reduce function-like summation on the list generated out of the
map function, that is, [2, 4, 6, 8], would generate an output equal to 20.
So map and reduce functions could be used in conjunction to process lists of data, where a function
is fi rst applied to each member of a list and then an aggregate function is applied to the transformed
and generated list.
This same simple idea of map and reduce has been extended to work on large data sets. The idea
is slightly modifi ed to work on collections of tuples or key/value pairs. The map function applies
a function on every key/value pair in the collection and generates a new collection. Then the reduce
function works on the new generated collection and applies an aggregate function to compute a fi nal
output. This is better understood through an example, so let me present a trivial one to explain the
fl ow. Say you have a collection of key/value pairs as follows:
[{ “94303”: “Tom”}, {“94303”: “Jane”}, {“94301”: “Arun”}, {“94302”: “Chen”}]
This is a collection of key/value pairs where the key is the zip code and the value is the name of a
person who resides within that zip code. A simple map function on this collection could get the
names of all those who reside in a particular zip code. The output of such a map function is as
follows:
[{“94303”:[“Tom”, “Jane”]}, {“94301”:[“Arun”]}, {“94302”:[“Chen”]}]
Search WWH ::




Custom Search