Database Reference
In-Depth Information
To use Big Data to serve our daily life, data scientists have explored a flurry of data
mining and machine learning algorithms to make sense of these data. Many of these
data analysis algorithms require iterative processing. For example, the well-known
PageRank algorithm parses the web linkage graph iteratively for deriving ranking
scores for web pages, and the K-means algorithm iteratively refines the cluster cen-
troids for grouping data points.
To analyze the massive data sets, a distributed computing framework is needed
on top of a cluster of servers. MapReduce is a framework proposed for Big Data pro-
cessing in a large-scale distributed environment. Since its introduction, MapReduce,
in particular its open-source implementation, Hadoop,* has become extremely popular
for analyzing large data sets. It provides a simple programming model and takes
care of the distributed execution, data exchanging, and fault tolerance, which enables
programmers with no experience on distributed systems to exploit a large cluster of
commodity machines to perform data-intensive computation.
However, Hadoop MapReduce is designed for a batch-oriented computation such
as log analysis and text processing. It lacks the built-in support for iterative process-
ing. In this chapter, we introduce iMapReduce that extends Hadoop to support itera-
tive processing. iMapReduce follows the programming paradigm of MapReduce.
Existing applications implemented in Hadoop or MapReduce can be easily adapted
into iMapReduce, and can benefit from the iterative processing support provided by
iMapReduce.
3.1 ITERATIVE ALGORITHMS IN MapReduce
Many data mining algorithms have an iterative process to operate data recursively.
We first briefly introduce MapReduce. Then, we provide three examples of iterative
algorithms and describe their MapReduce implementations. Through these exam-
ples, we summarize some observations of writing iterative algorithms in MapReduce.
Then we summarize the performance penalties of iterative computations in Hadoop
MapReduce.
3.1.1 m aP r eDuCe o verview
MapReduce is the most popular distributed framework for Big Data processing.
MapReduce integrates distributed file system (DFS) for scalable and reliable stor-
age. A MapReduce job reads the input data from DFS and writes the output data
to DFS. On DFS, a big file is divided into multiple blocks, which are distributed in
the cluster. Each file block has several copies that are stored on different nodes for
fault tolerance. In Hadoop ecosystem, Hadoop Distributed FileSystem (HDFS) is an
open-source implementation of DFS.
MapReduce exploits a batch-processing model with three main phases called map,
shuffle, and reduce. The input data on DFS is first divided into several splits. The
input data is expressed as a set of key-value pairs (KVs) for MapReduce processing.
* http://hadoop.apache.org/.
Search WWH ::




Custom Search