Database Reference
In-Depth Information
(
w
1
,
1)
,
(
w
2
,
1)
, . . . ,
(
w
n
,
1)
Note that a single Map task will typically process many documents - all the documents
in one or more chunks. Thus, its output will be more than the sequence for the one docu-
ment suggested above. Note also that if a word
w
appears
m
times among all the documents
assigned to that process, then there will be
m
key-value pairs (
w,
1) among its output. An
option, which we discuss in
Section 2.2.4
, is to combine these
m
pairs into a single pair (
w,
m
), but we can only do that because, as we shall see, the Reduce tasks apply an associative
and commutative operation, addition, to the values.
□
2.2.2
Grouping by Key
As soon as the Map tasks have all completed successfully, the key-value pairs are grouped
by key, and the values associated with each key are formed into a list of values. The group-
ing performed by the system, regardless of what the Map and Reduce tasks do. The master
controller process knows how many Reduce tasks there will be, say
r
such tasks. The user
typically tells the MapReduce system what
r
should be. Then the master controller picks a
hash function that applies to keys and produces a bucket number from 0 to
r
− 1. Each key
that is output by a Map task is hashed and its key-value pair is put in one of
r
local files.
Each file is destined for one of the Reduce tasks.
1
To perform the grouping by key and distribution to the Reduce tasks, the master control-
ler merges the files from each Map task that are destined for a particular Reduce task and
feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for
each key
k
, the input to the Reduce task that handles key
k
is a pair of the form (
k,
[
v
1
, v
2
,
. . . , v
n
]), where (
k, v
1
)
,
(
k, v
2
)
, . . . ,
(
k, v
n
) are all the key-value pairs with key
k
coming
from all the Map tasks.
2.2.3
The Reduce Tasks
The Reduce function's argument is a pair consisting of a key and its list of associated val-
ues. The output of the Reduce function is a sequence of zero or more key-value pairs. These
key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks,
but often they are the same type. We shall refer to the application of the Reduce function to
a single key and its associated list of values as a
reducer
.
A Reduce task receives one or more keys and their associated value lists. That is, a Re-
duce task executes one or more reducers. The outputs from all the Reduce tasks are merged
into a single file. Reducers may be partitioned among a smaller number of Reduce tasks is