NoSQL: What It Is and Why You Need It - Professional NoSQL

Databases Reference

In-Depth Information

Processing data spread across a cluster of horizontally scaled machines is complex. The MapReduce

model possibly provides one of the best possible methods to process large-scale data on a horizontal

cluster of machines.

Defi nition and Introduction

MapReduce is a parallel programming model that allows distributed processing on large data sets

on a cluster of computers. The MapReduce framework is patented ( http://patft.uspto.gov/

netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.

htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331 ) by Google, but the

ideas are freely shared and adopted in a number of open-source implementations.

MapReduce derives its ideas and inspiration from concepts in the world of functional programming.

Map and reduce are commonly used functions in the world of functional programming. In functional

programming, a map function applies an operation or a function to each element in a list. For

example, a multiply-by-two function on a list [1, 2, 3, 4] would generate another list as follows:

[2, 4, 6, 8]. When such functions are applied, the original list is not altered. Functional programming

believes in keeping data immutable and avoids sharing data among multiple processes or threads.

This means the map function that was just illustrated, trivial as it may be, could be run via two or

more multiple threads on the list and these threads would not step on each other, because the list

itself is not altered.

Like the map function, functional programming has a concept of a reduce function. Actually, a

reduce function in functional programming is more commonly known as a fold function. A reduce

or a fold function is also sometimes called an accumulate, compress, or inject function. A reduce or

fold function applies a function on all elements of a data structure, such as a list, and produces a

single result or output. So applying a reduce function-like summation on the list generated out of the

map function, that is, [2, 4, 6, 8], would generate an output equal to 20.

So map and reduce functions could be used in conjunction to process lists of data, where a function

is fi rst applied to each member of a list and then an aggregate function is applied to the transformed

and generated list.

This same simple idea of map and reduce has been extended to work on large data sets. The idea

is slightly modifi ed to work on collections of tuples or key/value pairs. The map function applies

a function on every key/value pair in the collection and generates a new collection. Then the reduce

function works on the new generated collection and applies an aggregate function to compute a fi nal

output. This is better understood through an example, so let me present a trivial one to explain the

fl ow. Say you have a collection of key/value pairs as follows:

[{ “94303”: “Tom”}, {“94303”: “Jane”}, {“94301”: “Arun”}, {“94302”: “Chen”}]

This is a collection of key/value pairs where the key is the zip code and the value is the name of a

person who resides within that zip code. A simple map function on this collection could get the

names of all those who reside in a particular zip code. The output of such a map function is as

follows:

[{“94303”:[“Tom”, “Jane”]}, {“94301”:[“Arun”]}, {“94302”:[“Chen”]}]

Search WWH ::

Custom Search

Home