Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

superior to the conventional way of using map-reduce to implement joins. However,

there are some cases where the proposed approach results in clear wins such as:

Analytic queries in which a very large fact table is joined with smaller dimension

tables.

Queries involving paths through graphs with high out-degree, such as the Web or

a social network.

Li et al. [ 175 ] have proposed a data analysis platform, based on MapReduce,

that is geared for incremental one-pass analytics. In particular, they replace the

sort-merge implementation in the standard MapReduce framework with a purely

hash-based framework, which is designed to address the computational and I/O

bottlenecks as well as the blocking behavior of the sort-merge algorithm. Therefore,

they devised two hash techniques to suit different user reduce functions, depending

on whether the reduce function permits incremental processing. Besides eliminating

the sorting cost from the map tasks, these hash techniques enable fast in-memory

processing of the reduce function when the memory reaches a sufficient size as

determined by the workload and algorithm. In addition, in order to bring the benefits

of fast in-memory processing to workloads that require a large key-state space

that far exceeds available memory, they presented a special technique to identify

frequent keys and then update their states using a full in-memory processing path,

both saving I/Os and also enabling early answers for these keys.

Supporting Iterative Processing

Many data analysis techniques (e.g. PageRank algorithm, recursive relational

queries, social network analysis) require iterative computations. These techniques

have a common requirement which is that data are processed iteratively until the

computation satisfies a convergence or stopping condition. The basic MapReduce

framework does not directly support these iterative data analysis applications.

Instead, programmers must implement iterative programs by manually issuing

multiple MapReduce jobs and orchestrating their execution using a driver program.

In practice, there are two key problems with manually orchestrating an iterative

program in MapReduce:

Even though much of the data may be unchanged from iteration to iteration, the

data must be re-loaded and re-processed at each iteration, wasting I/O, network

bandwidth and CPU resources.

The termination condition may involve the detection of when a fixpoint has been

reached. This condition may itself require an extra MapReduce job on each

iteration, again incurring overhead in terms of scheduling extra tasks, reading

extra data from disk and moving data across the network.

Search WWH ::

Custom Search

Home