Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Afrati and Ullman [6,7] have presented another approach to improve the join

phase in the MapReduce framework. The approach aims to optimize the commu-

nication cost by focusing on selecting the most appropriate attributes that are used

to partition and replicate the data among the reduce process. Therefore, it begins

by identifying the map-key , the set of attributes that identify the reduce process to

which a map process must send a particular tuple. Each attribute of the map-key gets

a “ share ,” which is the number of buckets into which its values are hashed, to form

a component of the identifier of a reduce process. Relations have their tuples repli-

cated in limited fashion of which the degree of replication depends on the shares for

those map-key attributes that are missing from their schema. The approach considers

two important special join cases: chain joins (represents a sequence of two-way join

operations where the output of one operation in this sequence is used as an input

to another operation in a pipelined fashion) and star joins (represents joining of a

large fact table with several smaller dimension tables). In each case, the proposed

algorithm is able to determine the map-key and determine the shares that yield the

least replication. The proposed approach is not always superior to the conventional

way of using map-reduce to implement joins. However, there are some cases where

the proposed approach results in clear wins such as

•

Analytic queries in which a very large fact table is joined with smaller

dimension tables

•

Queries involving paths through graphs with high out-degree, such as the

Web or a social network

Li et al. [90] have proposed a data analysis platform, based on MapReduce, that is

geared for incremental one-pass analytics. In particular, they replace the sort-merge

implementation in the standard MapReduce framework with a purely hash-based frame-

work, which is designed to address the computational and I/O bottlenecks as well as the

blocking behavior of the sort-merge algorithm. Therefore, they devised two hash tech-

niques to suit different user reduce functions, depending on whether the reduce function

permits incremental processing. Besides eliminating the sorting cost from the map tasks,

these hash techniques enable fast in-memory processing of the reduce function when the

memory reaches a sufficient size as determined by the workload and algorithm. In addi-

tion, to bring the benefits of fast in-memory processing to workloads that require a large

key-state space that far exceeds available memory, they presented a special technique

to identify frequent keys and then update their states using a full in-memory processing

path, both saving I/Os and also enabling early answers for these keys.

2.3.2 s uPPorting i iterative P roCessing

Many data analysis techniques (e.g., PageRank algorithm, recursive relational que-

ries, social network analysis) require iterative computations. These techniques have a

common requirement, which is that data are processed iteratively until the computa-

tion satisfies a convergence or stopping condition. The basic MapReduce framework

does not directly support these iterative data analysis applications. Instead, program-

mers must implement iterative programs by manually issuing multiple MapReduce

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home