Data Partitioning for Minimizing Transferred Data in MapReduce - Data Management in Cloud, Grid and P2P Systems - page 10

Databases Reference

In-Depth Information

1

1

N AT

RLS

MRP

N AT

RLS

MRP

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

Q5 (HJ)

Q7 (REPJ)

Q9 (HJ)

Q17 (COG)

5

10

15

20

25

Query

Cluster size

(b)

Fig. 2. Percentage of transferred data for a) different type of queries b) varying cluster

and data size

(a)

depending on the cluster size, so that each node is assigned 2GB of data. Fig 2(b)

shows the percentage of transferred data for the three approaches, while increas-

ing the number of cluster nodes. As shown, with increasing the number of nodes,

our approach maintains a steady data locality, but it decreases for the other ap-

proaches. Since there is no skew in key frequencies, both native Hadoop and RLS

obtain data localities near 1 divided by the number of nodes. Our experiments

with different data sizes for the same cluster size show no modification in the

percentage of transferred data for MR-Part (the results are not shown in the

paper due to space restrictions).

Response Time. As shown in previous subsection, MR-Part can significantly

reduce the amount of transferred data in the shue phase. However, its impact

on response time strongly depends on the network bandwidth. In this section, we

measure the effect of MR-Part on MapReduce response time by varying network

bandwidth. We control point-to-point bandwidth by using Linux tc command

line utility. We execute query Q5 on a cluster of 20 nodes with scale factor of 40

(40GB of dataset total size).

The results are shown in Fig 3. As we can see in Fig 3 (a), the slower is the

network, the biggest is the impact of data locality on execution time. To show

where the improvement is produced, in Fig 3 (b) we report the time spent in data

shuing. Measuring shue time is not straightforward since in native Hadoop it

starts once 5% of map tasks have finished and proceeds in parallel while they are

completed. Because of that, we represent two lines: NAT-ms that represents the

time spent since the first shue byte is sent until this phase is completed, and

NAT-os that represents the period of time where the system is only dedicated

to shuing (after last map finishes). For MR-Part only the second line has to be

represented as the system has to wait for all map tasks to complete in order to

schedule reduce tasks. We can observe that, while shue time is almost constant

for MR-Part, regardless of the network conditions, it increases significantly as

the network bandwidth decreases for the other alternatives. As a consequence,

the response time for MR-Part is less sensitive to the network bandwidth than

that of native Hadoop. For instance, for 10mbps, MR-Part executes in around

30% less time than native Hadoop.

Next Page

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home