Database Reference
In-Depth Information
the number of reduce tasks should be equal. The join operation and the map opera-
tion are performed in map tasks. The reduce operation is performed in reduce tasks.
A reduce task is connected to a local map task correspondingly for iterative process-
ing. The paired map task and reduce task are refereed as a MRPair . These map/
reduce tasks are persistent tasks that keep alive during the entire iterative process.
An application's structure data is partitioned and distributed to map tasks before
iterative processing. These structure data partitions are maintained locally in each
map task. Each structure data partition contains a set of structure KVs. As we know,
the structure KVs combined with the state KVs will be used for map operation. Each
structure KV can be mapped to one or more state KVs based on a mapping rule
(specified by join operation, i.e., “one-to-one,” “one-to-all,” or “one-to-more”). As
the structure KVs are distributed, the problem is, given a set of structure KVs (i.e.,
structure data partition) located in a map task, how to send their corresponding state
KVs to that map task, where they will be combined for the map operation.
Suppose the structure KVs are partitioned according to a hash function F . That is, a
structure KV with key SK is assigned to a partition according to p = F ( SK , n ), where p is
the partition id and n is the total number of partitions. Map task p is assigned for process-
ing partition p . Since reduce task p is connected to map task p , the output of reduce task
p should contain the state KVs corresponding to the structure KVs maintained in map
task p . The output of reduce task is determined by the reduce input, which is generated
in the shuffling phase. iMapReduce carefully picks the partition function in the shuffling
phase, such that the map output KVs are shuffled to the correct reduce task, where these
map output KVs are converted to the state KVs that correspond to the local maintained
structure KVs in the paired map task. This approach works for the “one-to-one” case and
the “one-to-more” case. For the “one-to-all” mapping case, the reduce task output should
be broadcasted to all map tasks. In summary, each MRPair performs the iterative com-
putation on a data partition. The necessary information exchanged between MRPairs
occurs during the mappers-to-reducers shuffling.
3.3.2 i iterative P roCessing
To support iterative processing in iMapReduce, a few changes to the original Hadoop
framework are made.
3.3.2.1 Persistent Tasks
A map/reduce task is a computing process with a specified map/reduce operation
applied on a subset of data records. In Hadoop MapReduce, each map/reduce task is
assigned to a slave worker processing a subset of the input/shuffled data, and its life
cycle ends when the assigned data records are processed.
In contrast, each map/reduce task in iMapReduce is persistent. A persistent map/
reduce task keeps alive during the entire iterative process. When all the assigned
data records of a persistent task are processed, the task becomes dormant, wait-
ing for the new input/shuffled data. For a map task, it waits for the results from the
reduce tasks, and is reactivated when the data from the reduce task arrives. For the
reduce task, it waits for all the map tasks' output and is reactivated synchronously
as in MapReduce.
Search WWH ::




Custom Search