Database Reference
In-Depth Information
Afrati and Ullman [6,7] have presented another approach to improve the join
phase in the MapReduce framework. The approach aims to optimize the commu-
nication cost by focusing on selecting the most appropriate attributes that are used
to partition and replicate the data among the reduce process. Therefore, it begins
by identifying the map-key , the set of attributes that identify the reduce process to
which a map process must send a particular tuple. Each attribute of the map-key gets
a “ share ,” which is the number of buckets into which its values are hashed, to form
a component of the identifier of a reduce process. Relations have their tuples repli-
cated in limited fashion of which the degree of replication depends on the shares for
those map-key attributes that are missing from their schema. The approach considers
two important special join cases: chain joins (represents a sequence of two-way join
operations where the output of one operation in this sequence is used as an input
to another operation in a pipelined fashion) and star joins (represents joining of a
large fact table with several smaller dimension tables). In each case, the proposed
algorithm is able to determine the map-key and determine the shares that yield the
least replication. The proposed approach is not always superior to the conventional
way of using map-reduce to implement joins. However, there are some cases where
the proposed approach results in clear wins such as
Analytic queries in which a very large fact table is joined with smaller
dimension tables
Queries involving paths through graphs with high out-degree, such as the
Web or a social network
Li et al. [90] have proposed a data analysis platform, based on MapReduce, that is
geared for incremental one-pass analytics. In particular, they replace the sort-merge
implementation in the standard MapReduce framework with a purely hash-based frame-
work, which is designed to address the computational and I/O bottlenecks as well as the
blocking behavior of the sort-merge algorithm. Therefore, they devised two hash tech-
niques to suit different user reduce functions, depending on whether the reduce function
permits incremental processing. Besides eliminating the sorting cost from the map tasks,
these hash techniques enable fast in-memory processing of the reduce function when the
memory reaches a sufficient size as determined by the workload and algorithm. In addi-
tion, to bring the benefits of fast in-memory processing to workloads that require a large
key-state space that far exceeds available memory, they presented a special technique
to identify frequent keys and then update their states using a full in-memory processing
path, both saving I/Os and also enabling early answers for these keys.
2.3.2 s uPPorting i iterative P roCessing
Many data analysis techniques (e.g., PageRank algorithm, recursive relational que-
ries, social network analysis) require iterative computations. These techniques have a
common requirement, which is that data are processed iteratively until the computa-
tion satisfies a convergence or stopping condition. The basic MapReduce framework
does not directly support these iterative data analysis applications. Instead, program-
mers must implement iterative programs by manually issuing multiple MapReduce
Search WWH ::




Custom Search