NoSQL: It’s about making intelligent choices - Making Sense of NoSQL - page 10

Databases Reference

In-Depth Information

1.3.2

Case study: Google's MapReduce—use commodity hardware

to create search indexes

One of the most influential case studies in the NoSQL movement is the Google

MapReduce system. In this paper, Google shared their process for transforming large

volumes of web data content into search indexes using low-cost commodity CPU s.

Though sharing of this information was significant, the concepts of map and reduce

weren't new. Map and reduce functions are simply names for two stages of a data

transformation, as described in figure 1.2.

The initial stages of the transformation are called the map operation . They're

responsible for data extraction, transformation, and filtering of data. The results of

the map operation are then sent to a second layer: the reduce function. The reduce

function is where the results are sorted, combined, and summarized to produce the

final result.

The core concepts behind the map and reduce functions are based on solid com-

puter science work that dates back to the 1950s when programmers at MIT imple-

mented these functions in the influential LISP system. LISP was different than other

programming languages because it emphasized functions that transformed isolated

lists of data. This focus is now the basis for many modern functional programming

languages that have desirable properties on distributed systems.

Google extended the map and reduce functions to reliably execute on billions of

web pages on hundreds or thousands of low-cost commodity CPU s. Google made map

and reduce work reliably on large volumes of data and did it at a low cost. It was

Google's use of MapReduce that encouraged others to take another look at the power

of functional programming and the ability of functional programming systems to

scale over thousands of low-cost CPU s. Software packages such as Hadoop have closely

modeled these functions.

The map layer extracts the data from

the input and transforms the results into

key-value pairs. The key-value pairs are

then sent to the shuffle/sort layer.

The shuffle/sort layer

returns the key-value pairs

sorted by the keys.

The reduce layer collects

the sorted results and performs

counts and totals before it returns

the final results.

Map

Map

Input

data

Shuffle

sort

Final

result

Reduce

Map

Map

Figure 1.2 The map and reduce functions are ways of partitioning large datasets into

smaller chunks that can be transformed on isolated and independent transformation

systems. The key is isolating each function so that it can be scaled onto many servers.

Next Page

Making Sense of NoSQL

Search WWH ::

Custom Search

Home