Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

emp-id

emp-info: bonus

dept-id

B

A

dept-info: bonus adjustment

1.1

0.9

emp-info: dept-id

B Innovation award ($100)

Hard worker award ($50)

NULL ($0)

High-performer ($150)

Innovation award ($100)

LHS mapper computes emp bonuses

1

2

3

B

A

RHS mapper retrieves Bonus adjustment

A

dept-id

Bonus adjustment

B

A

1.15

0.95

emp-id

Bonus

dept-id

1

B

$100

$50

$0

$150

$100

RHS reducer modified bonus adjustment and

sorts on dept-id

1

B

2

A

dept-id

Bonus adjustment

3

A

B

0.95

3

1.15

LHS reducer sorts on (dept-id, emp-id)

pair and sums up emp bonuses

dept-id

match keys on dept-id

emp-id

bonus-sum

2

A

$0

$250

$150

3

A

emp-id

Bonus

A sort-merge merger joins LHS and

RHS reduced outputs, then

computes final emp bonuses.

1

B

2

3

$0

$237.5

$172.5

1

FIGURE 2.5 A sample execution of the map-reduce-merge framework. (From H. C.

Yang et al., Map-reduce-merge: Simplified relational data processing on large clusters, in

SIGMOD , pp. 1029-1040, 2007.)

recursively, select data partitions based on query conditions, and feed only selected

partitions to other primitives.

The map-join-reduce [76] represents another approach that has been introduced

with a filtering-join-aggregation programming model as an extension of the standard

MapReduce's filtering-aggregation programming model. In particular, in addition to

the standard mapper and reducer operation of the standard MapReduce framework,

they introduce a third operation, join (called joiner), to the framework. Hence, to join

multiple data sets for aggregation, users specify a set of join () functions and the join

order between them. Then, the runtime system automatically joins the multiple input

data sets according to the join order and invoke join () functions to process the joined

records. They have also introduced a one-to-many shuffling strategy that shuffles

each intermediate key/value pair to many joiners at one time. Using a tailored parti-

tion strategy, they can utilize the one-to-many shuffling scheme to join multiple data

sets in one phase instead of a sequence of MapReduce jobs. The runtime system for

executing a map-join-reduce job launches two kinds of processes: MapTask and

ReduceTask . Mappers run inside the MapTask process, whereas joiners and reducers

are invoked inside the ReduceTask process. Therefore, map-join-reduce's process

model allows for the pipelining of intermediate results between joiners and reducers

since joiners and reducers are run inside the same ReduceTask process.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home