Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

emp-id

emp-info: dept-id

emp-info: bonus

dept-id

dept-info: bonus adjustment

1

B

innovation award ($100)

1.1

0.9

B

A

1

2

3

B

hard worker award ($50)

A

NULL ($0)

RHS mapper retrieves bonus adjustment

A

high-performer ($150)

A

Innovation award ($100)

dept-id

bonus adjustment

1.15

LHS mapper computes emp bonuses

B

0.95

A

emp-id

dept-id

bonus

1

B

$100

RHS reducer modified bonus adjustment and

sorts on dept-id

$50

$0

1

2

B

A

dept-id

bonus adjustment

3

$150

$100

A

0.95

1.15

B

LHS reducer sorts on (dept-id, emp-id)

pair and sums up emp bonuses

match keys on dept-id

emp-id

dept-id

bonus-sum

2

A

B

$0

3

1

$250

$150

emp-id

bonus

A sort-merge merger joins LHS and

RHS reduced outputs, then

computes final emp bonuses.

$0

$237.5

$172.5

2

3

1

Fig. 9.5

A sample execution of the Map-Reduce-Merge framework

Afrati and Ullman [ 60 , 61 ] have presented another approach to improve the

join phase in the MapReduce framework. The approach aims to optimize the

communication cost by focusing on selecting the most appropriate attributes that

are used to partition and replicate the data among the reduce process. Therefore,

it begins by identifying the map-key , the set of attributes that identify the Reduce

process to which a Map process must send a particular tuple. Each attribute of the

map-key gets a “ share ” which is the number of buckets into which its values are

hashed, to form a component of the identifier of a Reduce process. Relations have

their tuples replicated in limited fashion of which the degree of replication depends

on the shares for those map-key attributes that are missing from their schema.

The approach considers two important special join cases: chain joins (represents

a sequence of two-way join operations where the output of one operation in this

sequence is used as an input to another operation in a pipelined fashion) and star

joins (represents joining of a large fact table with several smaller dimension tables).

In each case, the proposed algorithm is able to determine the map-key and determine

the shares that yield the least replication. The proposed approach is not always

Cloud Data Management

Search WWH ::

Custom Search

Home