Database Reference
In-Depth Information
( w 1 , 1) , ( w 2 , 1) , . . . , ( w n , 1)
Note that a single Map task will typically process many documents - all the documents
in one or more chunks. Thus, its output will be more than the sequence for the one docu-
ment suggested above. Note also that if a word w appears m times among all the documents
assigned to that process, then there will be m key-value pairs ( w, 1) among its output. An
option, which we discuss in Section 2.2.4 , is to combine these m pairs into a single pair ( w,
m ), but we can only do that because, as we shall see, the Reduce tasks apply an associative
and commutative operation, addition, to the values.
2.2.2
Grouping by Key
As soon as the Map tasks have all completed successfully, the key-value pairs are grouped
by key, and the values associated with each key are formed into a list of values. The group-
ing performed by the system, regardless of what the Map and Reduce tasks do. The master
controller process knows how many Reduce tasks there will be, say r such tasks. The user
typically tells the MapReduce system what r should be. Then the master controller picks a
hash function that applies to keys and produces a bucket number from 0 to r − 1. Each key
that is output by a Map task is hashed and its key-value pair is put in one of r local files.
Each file is destined for one of the Reduce tasks. 1
To perform the grouping by key and distribution to the Reduce tasks, the master control-
ler merges the files from each Map task that are destined for a particular Reduce task and
feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for
each key k , the input to the Reduce task that handles key k is a pair of the form ( k, [ v 1 , v 2 ,
. . . , v n ]), where ( k, v 1 ) , ( k, v 2 ) , . . . , ( k, v n ) are all the key-value pairs with key k coming
from all the Map tasks.
2.2.3
The Reduce Tasks
The Reduce function's argument is a pair consisting of a key and its list of associated val-
ues. The output of the Reduce function is a sequence of zero or more key-value pairs. These
key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks,
but often they are the same type. We shall refer to the application of the Reduce function to
a single key and its associated list of values as a reducer .
A Reduce task receives one or more keys and their associated value lists. That is, a Re-
duce task executes one or more reducers. The outputs from all the Reduce tasks are merged
into a single file. Reducers may be partitioned among a smaller number of Reduce tasks is
Search WWH ::




Custom Search