Data Partitioning for Minimizing Transferred Data in MapReduce - Data Management in Cloud, Grid and P2P Systems

Databases Reference

In-Depth Information

Algorithm 1. Metadata combination

Data : F : Input file; W : Set of jobs composing the workload

Result : H =( H V ,H E ): Hypergraph modeling the workload

begin

H E ←∅

; H V ←∅

foreach job

∈|

| do

←∅

foreach m i ∈

←∅

; K

M job do

md i ←

getMetadata ( m i )

if F = getFile ( md i ) then

foreach

md i do

{t 1 . id , ..., t n . id }←generateTupleID ( c i , {rc 1 , ..., rc n } )

T [ k ]

{

rc 1 , ..., rc n } ∈

←

T [ k ]

∪{

t 1 . id , ..., t n . id

}

; K

←

∪{

}

foreach intermediate key k ∈ K do

H V ← H V ∪ T [ k ]; H E ← H E ∪{T [ k ] }

3.2 Repartitioning

Once we have modeled the workload of each input file through a hypergraph,

we apply a min-cut k -way graph partitioning algorithm. The algorithm takes as

input a value k and a hypergraph, and produces k disjoint subsets of vertices

minimizing the sum of the weights of the edges between vertices of different

subsets. Weights can be associated to vertices, for instance to represent different

sizes. We set k as the number of chunks in the input file. By using the min-cut

algorithm, the tuples that are used for generating the same intermediate key are

usually assigned to the same partition.

The output of the algorithm indicates the set of tuples that have to be assigned

to each of the input file chunks. Then, the input file should be repartitioned using

the produced assignments. However, the file repartitioning cannot be done in a

straightforward manner, particularly because the chunks are created by HDFS

automatically as new data is appended to a file. We create a set of temporary

files, one for each partition. Then, we read the original file, and for each read

tuple, the graph algorithm output indicates to which of the temporary files the

tuple should be copied. Then, two strategies are possible: 1) create a set of files in

one directory, one per partition, as it is done in the reduce phase of MapReduce

executions and 2) write the generated files sequentially in the same file. In both

cases, at the end of the process, we remove the old file and rename the new

file/directory to its name. The first strategy is straightforward and instead of

writing data in temporary files, it can be written directly in HDFS. The second

one has the advantage of not having to deal with more files but has to deal with

the following issues:

- Unfitted Partitions : The size of partitions created by the partitioning algo-

rithm may be different than the predefined chunk size, even if we set strict

imbalance constraints in the algorithm. To approximate the chunk limits

to the end of the temporary files when written one after the other, we can

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home