Database Reference
In-Depth Information
failure because their output is stored on the local disk(s) of the failed machine and is
therefore inaccessible. Completed reduce tasks do not need to be re-executed since
their output is stored in a global file system.
9.3
Extensions and Enhancements of the MapReduce
Framework
In practice, the basic implementation of the MapReduce is very useful for handling
data processing and data loading in a heterogenous system with many different
storage systems. Moreover, it provides a flexible framework for the execution of
more complicated functions than that can be directly supported in SQL. However,
this basic architecture suffers from some limitations. Dean and Ghemawa [ 120 ]
reported about some possible improvements that can be incorporated into the
MapReduce framework. Examples of these possible improvements include:
￿
MapReduce should take advantage of natural indices whenever possible.
￿
Most MapReduce output can be left unmerged since there is no benefit of
merging them if the next consumer is just another MapReduce program.
￿
MapReduce users should avoid using inefficient textual formats.
In the following subsections we discuss some research efforts that have been
conducted in order to deal with these challenges and the different improvements
that has been made on the basic implementation of the MapReduce framework in
order to achieve these goals.
Processing Join Operations
One main limitation of the MapReduce framework is that it does not support the
joining of multiple datasets in one task. However, this can still be achieved with
additional MapReduce steps. For example, users can map and reduce one dataset
and read data from other datasets on the fly. Blanas et al. [ 82 ] have reported
about a study that evaluated the performance of different distributed join algorithms
using the MapReduce framework. In particular, they have evaluated the following
implementation strategies of distributed join algorithms:
￿
Standard repartition join : The two input relations are dynamically partitioned
on the join key and the corresponding pairs of partitions are joined using the
standard partitioned sort-merge join approach.
￿
Improved repartition join : One potential problem with the standard repartition
join is that all the records for a given join key from both input relations have
to be buffered. Therefore, when the key cardinality is small or when the data is
highly skewed, all the records for a given join key may not fit in memory. The
Search WWH ::




Custom Search