Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

failure because their output is stored on the local disk(s) of the failed machine and is

therefore inaccessible. Completed reduce tasks do not need to be re-executed since

their output is stored in a global file system.

9.3

Extensions and Enhancements of the MapReduce

Framework

In practice, the basic implementation of the MapReduce is very useful for handling

data processing and data loading in a heterogenous system with many different

storage systems. Moreover, it provides a flexible framework for the execution of

more complicated functions than that can be directly supported in SQL. However,

this basic architecture suffers from some limitations. Dean and Ghemawa [ 120 ]

reported about some possible improvements that can be incorporated into the

MapReduce framework. Examples of these possible improvements include:

MapReduce should take advantage of natural indices whenever possible.

Most MapReduce output can be left unmerged since there is no benefit of

merging them if the next consumer is just another MapReduce program.

MapReduce users should avoid using inefficient textual formats.

In the following subsections we discuss some research efforts that have been

conducted in order to deal with these challenges and the different improvements

that has been made on the basic implementation of the MapReduce framework in

order to achieve these goals.

Processing Join Operations

One main limitation of the MapReduce framework is that it does not support the

joining of multiple datasets in one task. However, this can still be achieved with

additional MapReduce steps. For example, users can map and reduce one dataset

and read data from other datasets on the fly. Blanas et al. [ 82 ] have reported

about a study that evaluated the performance of different distributed join algorithms

using the MapReduce framework. In particular, they have evaluated the following

implementation strategies of distributed join algorithms:

Standard repartition join : The two input relations are dynamically partitioned

on the join key and the corresponding pairs of partitions are joined using the

standard partitioned sort-merge join approach.

Improved repartition join : One potential problem with the standard repartition

join is that all the records for a given join key from both input relations have

to be buffered. Therefore, when the key cardinality is small or when the data is

highly skewed, all the records for a given join key may not fit in memory. The

Search WWH ::

Custom Search

Home