Database Reference
In-Depth Information
clusters or the compute capacity upon which they sit. Hence, such services enable
third parties to perform their analytical queries on massive data sets with minimum
effort and cost by abstracting the complexity entailed in building and maintaining
computer clusters.
The implementation of the basic MapReduce architecture has had some limita-
tions. Therefore, several research efforts have been triggered to tackle these limi-
tations by introducing several advancements in the basic architecture to improve
its performance. This chapter provides a comprehensive survey for a family of
approaches and mechanisms of large-scale data analysis mechanisms that have
been implemented based on the original idea of the MapReduce framework and
are currently gaining a lot of momentum in both research and industrial communi-
ties. In particular, the remainder of this chapter is organized as follows. Section 2.2
describes the basic architecture of the MapReduce framework. Section 2.3 discusses
several techniques that have been proposed to improve the performance and capa-
bilities of the MapReduce framework from different perspectives. Section 2.4 cov-
ers several systems that support a high level SQL-like interface for the MapReduce
framework, while Section 2.5 provides an overview of several research efforts of
developing MapReduce-based solutions for data-intensive applications for different
data models and different computationally expensive operations. Section 2.6 reviews
several large-scale data-processing systems that resemble some of the ideas of the
MapReduce framework, without sharing its architecture or infrastructure, for differ-
ent purposes and application scenarios. In Section 2.7, we conclude the chapter and
discuss some of the future research directions for implementing the next generation
of MapReduce/Hadoop-like solutions.
2.2 MapReduce FRAMEWORK: BASIC ARCHITECTURE
The MapReduce framework is introduced as a simple and powerful programming
model that enables easy development of scalable parallel applications to process vast
amounts of data on large clusters of commodity machines [43,44]. In particular, the
implementation described in the original paper is mainly designed to achieve high
performance on large clusters of commodity PCs. One of the main advantages of
this approach is that it isolates the application from the details of running a distrib-
uted program, such as issues on data distribution, scheduling, and fault tolerance. In
this model, the computation takes a set of key/value pairs input and produces a set
of key/value pairs as output. The user of the MapReduce framework expresses the
computation using two functions: map and reduce . The map function takes an input
pair and produces a set of intermediate key/value pairs. The MapReduce framework
groups together all intermediate values associated with the same intermediate key I
and passes them to the reduce function. The reduce function receives an intermedi-
ate key I with its set of values and merges them together. Typically, just zero or one
output value is produced per reduce invocation. The main advantage of this model
is that it allows large computations to be easily parallelized and re-executed to be
used as the primary mechanism for fault tolerance. Figure 2.1 illustrates an example
MapReduce program expressed in pseudo-code for counting the number of occur-
rences of each word in a collection of documents. In this example, the map function
Search WWH ::




Custom Search