Information Technology Reference
In-Depth Information
17.2 Google MapReduce
The MapReduce architecture and programming model pioneered by Google
is an example of a modern system architecture designed for processing and
analyzing large data sets and is being used successfully by Google in many
applications to process massive amounts of raw Web data. The MapReduce
system runs on top of the Google File System, within which data are loaded
and partitioned into chunks and each chunk is replicated. Data processing
is colocated with data storage: when a file needs to be processed, the job
scheduler consults a storage metadata service to get the host node for each
chunk and then schedules a map process on that node, so that data locality is
exploited efficiently.
Google engineers designed MapReduce to solve a specific prac-
tical problem. Therefore, it was designed as a programming
model combined with the implementation of that model—in
essence, a reference implementation. The reference implementa-
tion was used to demonstrate the practicality and effectiveness of the
concept and to help ensure that this model would be widely adopted by
the computer industry. Over the years, other implementations of
MapReduce have been created and are available as both open source
and commercial products.
The MapReduce architecture allows programmers to use a functional pro-
gramming style to create a map function that processes a key-value pair
associated with the input data to generate a set of intermediate key-value
pairs and a reduce function that merges all intermediate values associated
with the same intermediate key.
Users define a map and a reduce function:
1. The map function processes a (key, value) pair and returns a list of
intermediate (key, value) pairs:
map (in _ key,in _ value)— > list(out _ key,
intermediate _ value).
2. The reduce function merges all intermediate values having the same
intermediate key:
reduce (out _ key, list(intermediate _ value)) → list
(o ut _ v alue).
The former processes an input key-value pair, producing a set of intermedi-
ate pairs. The latter is in charge of combining all of the intermediate values
Search WWH ::




Custom Search