Database Reference
In-Depth Information
the intermediate output .k3; Œ v 4/ from another lineage (ˇ). Based on keys k2 and
k3, the merge function combines the two reduced outputs from different lineages
into a list of key/value outputs Œ.k4; v 5/. This final output becomes a new lineage
( ). If ˛ D ˇ then this merge function does a self-merge which is similar to self-
join in relational algebra. The main differences between the processing model of
this framework and the original MapReduce is the production of a key/value list
from the reduce function instead of just that of values. This change is introduced
because the merge function requires input datasets to be organized (partitioned, then
either sorted or hashed) by keys and these keys have to be passed into the function
to be merged. In the original framework, the reduced output is final. Hence, users
pack whatever is needed in Πv 3 while passing k2 for the next stage is not required.
Figure 9.5 illustrates a sample execution of the Map-Reduce-Merge framework. In
this example, there are two datasets Employee and Department where Employee's
key attribute is emp-id and the Department's key is dept-id . The execution of
this example query aims to join these two datasets and compute employee bonuses.
On the left hand side of Fig. 9.5 , a mapper reads Employee entries and computes a
bonus for each entry. A reducer then sums up these bonuses for every employee
and sorts them by dept-id , then emp-id . On the right hand side, a mapper
reads Department entries and computes bonus adjustments. A reducer then sorts
these department entries. At the end, a merger matches the output records from the
two reducers on dept-id and applies a department-based bonus adjustment on
employee bonuses. Yang et al. [ 104 ] have also proposed an approach for improving
the Map-Reduce-Merge framework by adding a new primitive called Traverse .This
primitive can process index file entries recursively, select data partitions based on
query conditions and feed only selected partitions to other primitives.
The Map-Join-Reduce [ 154 ] represents another approach that has been intro-
duced with a filtering-join-aggregation programming model as an extension of the
standard MapReduce's filtering-aggregation programming model. In particular, in
addition to the standard mapper and reducer operation of the standard MapReduce
framework, they introduce a third operation, join (called joiner), to the framework.
Hence, to join multiple datasets for aggregation, users specify a set of join ()
functions and the join order between them. Then, the runtime system automatically
joins the multiple input datasets according to the join order and invoke join ()
functions to process the joined records. They have also introduced a one-to-many
shuffling strategy which shuffles each intermediate key/value pair to many joiners
at one time. Using a tailored partition strategy, they can utilize the one-to-many
shuffling scheme to join multiple datasets in one phase instead of a sequence
of MapReduce jobs. The runtime system for executing a Map-Join-Reduce job
launches two kinds of processes: MapTask , and ReduceTask . Mappers run inside
the MapTask process while joiners and reducers are invoked inside the ReduceTask
process. Therefore, Map-Join-Reduce's process model allows for the pipelining of
intermediate results between joiners and reducers since joiners and reducers are run
inside the same ReduceTask process.
Search WWH ::




Custom Search