Databases Reference
In-Depth Information
Algorithm 1. Metadata combination
Data : F : Input file; W : Set of jobs composing the workload
Result : H =( H V ,H E ): Hypergraph modeling the workload
begin
H E ←∅
; H V ←∅
foreach job
∈|
W
| do
T
←∅
foreach m i
←∅
; K
M job do
md i
getMetadata ( m i )
if F = getFile ( md i ) then
foreach
md i do
{t 1 . id , ..., t n . id }←generateTupleID ( c i , {rc 1 , ..., rc n } )
T [ k ]
k,
{
rc 1 , ..., rc n } ∈
T [ k ]
∪{
t 1 . id , ..., t n . id
}
; K
K
∪{
k
}
foreach intermediate key k ∈ K do
H V ← H V ∪ T [ k ]; H E ← H E ∪{T [ k ] }
3.2 Repartitioning
Once we have modeled the workload of each input file through a hypergraph,
we apply a min-cut k -way graph partitioning algorithm. The algorithm takes as
input a value k and a hypergraph, and produces k disjoint subsets of vertices
minimizing the sum of the weights of the edges between vertices of different
subsets. Weights can be associated to vertices, for instance to represent different
sizes. We set k as the number of chunks in the input file. By using the min-cut
algorithm, the tuples that are used for generating the same intermediate key are
usually assigned to the same partition.
The output of the algorithm indicates the set of tuples that have to be assigned
to each of the input file chunks. Then, the input file should be repartitioned using
the produced assignments. However, the file repartitioning cannot be done in a
straightforward manner, particularly because the chunks are created by HDFS
automatically as new data is appended to a file. We create a set of temporary
files, one for each partition. Then, we read the original file, and for each read
tuple, the graph algorithm output indicates to which of the temporary files the
tuple should be copied. Then, two strategies are possible: 1) create a set of files in
one directory, one per partition, as it is done in the reduce phase of MapReduce
executions and 2) write the generated files sequentially in the same file. In both
cases, at the end of the process, we remove the old file and rename the new
file/directory to its name. The first strategy is straightforward and instead of
writing data in temporary files, it can be written directly in HDFS. The second
one has the advantage of not having to deal with more files but has to deal with
the following issues:
- Unfitted Partitions : The size of partitions created by the partitioning algo-
rithm may be different than the predefined chunk size, even if we set strict
imbalance constraints in the algorithm. To approximate the chunk limits
to the end of the temporary files when written one after the other, we can
Search WWH ::




Custom Search