Databases Reference
In-Depth Information
1.3.2
Case study: Google's MapReduce—use commodity hardware
to create search indexes
One of the most influential case studies in the NoSQL movement is the Google
MapReduce system. In this paper, Google shared their process for transforming large
volumes of web data content into search indexes using low-cost commodity CPU s.
Though sharing of this information was significant, the concepts of map and reduce
weren't new. Map and reduce functions are simply names for two stages of a data
transformation, as described in figure 1.2.
The initial stages of the transformation are called the map operation . They're
responsible for data extraction, transformation, and filtering of data. The results of
the map operation are then sent to a second layer: the reduce function. The reduce
function is where the results are sorted, combined, and summarized to produce the
final result.
The core concepts behind the map and reduce functions are based on solid com-
puter science work that dates back to the 1950s when programmers at MIT imple-
mented these functions in the influential LISP system. LISP was different than other
programming languages because it emphasized functions that transformed isolated
lists of data. This focus is now the basis for many modern functional programming
languages that have desirable properties on distributed systems.
Google extended the map and reduce functions to reliably execute on billions of
web pages on hundreds or thousands of low-cost commodity CPU s. Google made map
and reduce work reliably on large volumes of data and did it at a low cost. It was
Google's use of MapReduce that encouraged others to take another look at the power
of functional programming and the ability of functional programming systems to
scale over thousands of low-cost CPU s. Software packages such as Hadoop have closely
modeled these functions.
The map layer extracts the data from
the input and transforms the results into
key-value pairs. The key-value pairs are
then sent to the shuffle/sort layer.
The shuffle/sort layer
returns the key-value pairs
sorted by the keys.
The reduce layer collects
the sorted results and performs
counts and totals before it returns
the final results.
Map
Map
Input
data
Shuffle
sort
Final
result
Reduce
Map
Map
Figure 1.2 The map and reduce functions are ways of partitioning large datasets into
smaller chunks that can be transformed on isolated and independent transformation
systems. The key is isolating each function so that it can be scaled onto many servers.
Search WWH ::




Custom Search