Database Reference
In-Depth Information
superior to the conventional way of using map-reduce to implement joins. However,
there are some cases where the proposed approach results in clear wins such as:
￿
Analytic queries in which a very large fact table is joined with smaller dimension
tables.
￿
Queries involving paths through graphs with high out-degree, such as the Web or
a social network.
Li et al. [ 175 ] have proposed a data analysis platform, based on MapReduce,
that is geared for incremental one-pass analytics. In particular, they replace the
sort-merge implementation in the standard MapReduce framework with a purely
hash-based framework, which is designed to address the computational and I/O
bottlenecks as well as the blocking behavior of the sort-merge algorithm. Therefore,
they devised two hash techniques to suit different user reduce functions, depending
on whether the reduce function permits incremental processing. Besides eliminating
the sorting cost from the map tasks, these hash techniques enable fast in-memory
processing of the reduce function when the memory reaches a sufficient size as
determined by the workload and algorithm. In addition, in order to bring the benefits
of fast in-memory processing to workloads that require a large key-state space
that far exceeds available memory, they presented a special technique to identify
frequent keys and then update their states using a full in-memory processing path,
both saving I/Os and also enabling early answers for these keys.
Supporting Iterative Processing
Many data analysis techniques (e.g. PageRank algorithm, recursive relational
queries, social network analysis) require iterative computations. These techniques
have a common requirement which is that data are processed iteratively until the
computation satisfies a convergence or stopping condition. The basic MapReduce
framework does not directly support these iterative data analysis applications.
Instead, programmers must implement iterative programs by manually issuing
multiple MapReduce jobs and orchestrating their execution using a driver program.
In practice, there are two key problems with manually orchestrating an iterative
program in MapReduce:
￿
Even though much of the data may be unchanged from iteration to iteration, the
data must be re-loaded and re-processed at each iteration, wasting I/O, network
bandwidth and CPU resources.
￿
The termination condition may involve the detection of when a fixpoint has been
reached. This condition may itself require an extra MapReduce job on each
iteration, again incurring overhead in terms of scheduling extra tasks, reading
extra data from disk and moving data across the network.
Search WWH ::




Custom Search