Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

clusters or the compute capacity upon which they sit. Hence, such services enable

third parties to perform their analytical queries on massive data sets with minimum

effort and cost by abstracting the complexity entailed in building and maintaining

computer clusters.

The implementation of the basic MapReduce architecture has had some limita-

tions. Therefore, several research efforts have been triggered to tackle these limi-

tations by introducing several advancements in the basic architecture to improve

its performance. This chapter provides a comprehensive survey for a family of

approaches and mechanisms of large-scale data analysis mechanisms that have

been implemented based on the original idea of the MapReduce framework and

are currently gaining a lot of momentum in both research and industrial communi-

ties. In particular, the remainder of this chapter is organized as follows. Section 2.2

describes the basic architecture of the MapReduce framework. Section 2.3 discusses

several techniques that have been proposed to improve the performance and capa-

bilities of the MapReduce framework from different perspectives. Section 2.4 cov-

ers several systems that support a high level SQL-like interface for the MapReduce

framework, while Section 2.5 provides an overview of several research efforts of

developing MapReduce-based solutions for data-intensive applications for different

data models and different computationally expensive operations. Section 2.6 reviews

several large-scale data-processing systems that resemble some of the ideas of the

MapReduce framework, without sharing its architecture or infrastructure, for differ-

ent purposes and application scenarios. In Section 2.7, we conclude the chapter and

discuss some of the future research directions for implementing the next generation

of MapReduce/Hadoop-like solutions.

2.2 MapReduce FRAMEWORK: BASIC ARCHITECTURE

The MapReduce framework is introduced as a simple and powerful programming

model that enables easy development of scalable parallel applications to process vast

amounts of data on large clusters of commodity machines [43,44]. In particular, the

implementation described in the original paper is mainly designed to achieve high

performance on large clusters of commodity PCs. One of the main advantages of

this approach is that it isolates the application from the details of running a distrib-

uted program, such as issues on data distribution, scheduling, and fault tolerance. In

this model, the computation takes a set of key/value pairs input and produces a set

of key/value pairs as output. The user of the MapReduce framework expresses the

computation using two functions: map and reduce . The map function takes an input

pair and produces a set of intermediate key/value pairs. The MapReduce framework

groups together all intermediate values associated with the same intermediate key I

and passes them to the reduce function. The reduce function receives an intermedi-

ate key I with its set of values and merges them together. Typically, just zero or one

output value is produced per reduce invocation. The main advantage of this model

is that it allows large computations to be easily parallelized and re-executed to be

used as the primary mechanism for fault tolerance. Figure 2.1 illustrates an example

MapReduce program expressed in pseudo-code for counting the number of occur-

rences of each word in a collection of documents. In this example, the map function

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home