Database Reference
In-Depth Information
produced by massive-scale simulations, sensor deployments, high-throughput lab
equipment) [ 206 ]. Although parallel database systems [ 122 ] serve some of these
data analysis applications (e.g. Teradata [ 45 ], SQL Server PDW [ 36 ], Vertica [ 51 ],
Greenplum [ 25 ], ParAccel [ 40 ], Netezza [ 31 ], they are expensive, difficult to admin-
ister and lack fault-tolerance for long-running queries [ 194 ]. MapReduce [ 118 ]isa
framework which is introduced by Google for programming commodity computer
clusters to perform large-scale data processing in a single pass. The framework is
designed such that a MapReduce cluster can scale to thousands of nodes in a fault-
tolerant manner. One of the main advantages of this framework is its reliance on
a simple and powerful programming model. In addition, it isolates the application
developer from all the complex details of running a distributed program such as:
issues on data distribution, scheduling and fault tolerance [ 193 ].
In principle, the success of many enterprises often rely on their ability to analyze
expansive volumes of data. In general, cost-effective processing of large datasets is
a nontrivial undertaking. Fortunately, MapReduce frameworks and cloud computing
have made it easier than ever for everyone to step into the world of big data.
This technology combination has enabled even small companies to collect and
analyze terabytes of data in order to gain a competitive edge. For example, the
Amazon Elastic Compute Cloud (EC2) [ 4 ] is offered as a commodity that can
be purchased and utilised. In addition, Amazon has also provided the Amazon
Elastic MapReduce [ 6 ] as an online service to easily and cost-effectively process
vast amounts of data without the need to worry about time-consuming set-up,
management or tuning of computing clusters or the compute capacity upon which
they sit. Hence, such services enable third-parties to perform their analytical queries
on massive datasets with minimum effort and cost by abstracting the complexity
entailed in building and maintaining computer clusters.
The implementation of the basic MapReduce architecture had some limitations.
Therefore, several research efforts have been triggered to tackle these limitations
by introducing several advancements in the basic architecture in order to improve
its performance. This chapter provides a comprehensive survey for a family of
approaches and mechanisms of large scale data analysis mechanisms that have
been implemented based on the original idea of the MapReduce framework and are
currently gaining a lot of momentum in both research and industrial communities.
In particular, the remainder of this chapter is organized as follows. Section 9.2
describes the basic architecture of the MapReduce framework. Section 9.3 discusses
several techniques that have been proposed to improve the performance and capabil-
ities of the MapReduce framework from different perspectives. Section 9.4 covers
several systems that support a high level SQL-like interface for the MapReduce
framework. In Sect. 9.5 , we conclude the chapter and discuss some of the future
research directions for implementing the next generation of MapReduce/Hadoop-
like solutions.
Search WWH ::




Custom Search