Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

the intermediate output .k3; Œ v 4/ from another lineage (ˇ). Based on keys k2 and

k3, the merge function combines the two reduced outputs from different lineages

into a list of key/value outputs Œ.k4; v 5/. This final output becomes a new lineage

( ). If ˛ D ˇ then this merge function does a self-merge which is similar to self-

join in relational algebra. The main differences between the processing model of

this framework and the original MapReduce is the production of a key/value list

from the reduce function instead of just that of values. This change is introduced

because the merge function requires input datasets to be organized (partitioned, then

either sorted or hashed) by keys and these keys have to be passed into the function

to be merged. In the original framework, the reduced output is final. Hence, users

pack whatever is needed in Œ v 3 while passing k2 for the next stage is not required.

Figure 9.5 illustrates a sample execution of the Map-Reduce-Merge framework. In

this example, there are two datasets Employee and Department where Employee's

key attribute is emp-id and the Department's key is dept-id . The execution of

this example query aims to join these two datasets and compute employee bonuses.

On the left hand side of Fig. 9.5 , a mapper reads Employee entries and computes a

bonus for each entry. A reducer then sums up these bonuses for every employee

and sorts them by dept-id , then emp-id . On the right hand side, a mapper

reads Department entries and computes bonus adjustments. A reducer then sorts

these department entries. At the end, a merger matches the output records from the

two reducers on dept-id and applies a department-based bonus adjustment on

employee bonuses. Yang et al. [ 104 ] have also proposed an approach for improving

the Map-Reduce-Merge framework by adding a new primitive called Traverse .This

primitive can process index file entries recursively, select data partitions based on

query conditions and feed only selected partitions to other primitives.

The Map-Join-Reduce [ 154 ] represents another approach that has been intro-

duced with a filtering-join-aggregation programming model as an extension of the

standard MapReduce's filtering-aggregation programming model. In particular, in

addition to the standard mapper and reducer operation of the standard MapReduce

framework, they introduce a third operation, join (called joiner), to the framework.

Hence, to join multiple datasets for aggregation, users specify a set of join ()

functions and the join order between them. Then, the runtime system automatically

joins the multiple input datasets according to the join order and invoke join ()

functions to process the joined records. They have also introduced a one-to-many

shuffling strategy which shuffles each intermediate key/value pair to many joiners

at one time. Using a tailored partition strategy, they can utilize the one-to-many

shuffling scheme to join multiple datasets in one phase instead of a sequence

of MapReduce jobs. The runtime system for executing a Map-Join-Reduce job

launches two kinds of processes: MapTask , and ReduceTask . Mappers run inside

the MapTask process while joiners and reducers are invoked inside the ReduceTask

process. Therefore, Map-Join-Reduce's process model allows for the pipelining of

intermediate results between joiners and reducers since joiners and reducers are run

inside the same ReduceTask process.

Cloud Data Management

Search WWH ::

Custom Search

Home