MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

( w 1 , 1) , ( w 2 , 1) , . . . , ( w n , 1)

Note that a single Map task will typically process many documents - all the documents

in one or more chunks. Thus, its output will be more than the sequence for the one docu-

ment suggested above. Note also that if a word w appears m times among all the documents

assigned to that process, then there will be m key-value pairs ( w, 1) among its output. An

option, which we discuss in Section 2.2.4 , is to combine these m pairs into a single pair ( w,

m ), but we can only do that because, as we shall see, the Reduce tasks apply an associative

and commutative operation, addition, to the values.

□

2.2.2

Grouping by Key

As soon as the Map tasks have all completed successfully, the key-value pairs are grouped

by key, and the values associated with each key are formed into a list of values. The group-

ing performed by the system, regardless of what the Map and Reduce tasks do. The master

controller process knows how many Reduce tasks there will be, say r such tasks. The user

typically tells the MapReduce system what r should be. Then the master controller picks a

hash function that applies to keys and produces a bucket number from 0 to r − 1. Each key

that is output by a Map task is hashed and its key-value pair is put in one of r local files.

Each file is destined for one of the Reduce tasks. 1

To perform the grouping by key and distribution to the Reduce tasks, the master control-

ler merges the files from each Map task that are destined for a particular Reduce task and

feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for

each key k , the input to the Reduce task that handles key k is a pair of the form ( k, [ v 1 , v 2 ,

. . . , v n ]), where ( k, v 1 ) , ( k, v 2 ) , . . . , ( k, v n ) are all the key-value pairs with key k coming

from all the Map tasks.

2.2.3

The Reduce Tasks

The Reduce function's argument is a pair consisting of a key and its list of associated val-

ues. The output of the Reduce function is a sequence of zero or more key-value pairs. These

key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks,

but often they are the same type. We shall refer to the application of the Reduce function to

a single key and its associated list of values as a reducer .

A Reduce task receives one or more keys and their associated value lists. That is, a Re-

duce task executes one or more reducers. The outputs from all the Reduce tasks are merged

into a single file. Reducers may be partitioned among a smaller number of Reduce tasks is

Search WWH ::

Custom Search

Home