Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

In the last two decades, the continuous increase of computational power has pro-

duced an overwhelming flow of data, which has called for a paradigm shift in the

computing architecture and large-scale data-processing mechanisms. MapReduce

is a simple and powerful programming model that enables easy development of

scalable parallel applications to process vast amounts of data on large clusters of

commodity machines. It isolates the application from the details of running a dis-

tributed program such as issues on data distribution, scheduling, and fault tolerance.

However, the original implementation of the MapReduce framework had some limi-

tations that have been tackled by many research efforts in several follow-up works

after its introduction. This chapter provides a comprehensive survey for a family of

approaches and mechanisms of large-scale data-processing mechanisms that have

been implemented based on the original idea of the MapReduce framework and

are currently gaining a lot of momentum in both research and industrial communi-

ties. We also cover a set of systems that have been implemented to provide declara-

tive programming interfaces on top of the MapReduce framework. In addition, we

discuss a set of MapReduce-based approaches for processing massive data sets of

different data models (e.g., XML, RDF, Graphs) and computationally expensive

data-intensive operations. In addition, we review several large-scale data-processing

systems that resemble some of the ideas of the MapReduce framework for different

purposes and application scenarios. Finally, we discuss some of the future research

directions for implementing the next generation of MapReduce-like solutions.

2.1 INTRODUCTION

We live in the era of Big Data where we are witnessing a continuous increase on the

computational power that produces an overwhelming flow of data, which has called

for a paradigm shift in the computing architecture and large-scale data-processing

mechanisms. Powerful telescopes in astronomy, particle accelerators in physics, and

genome sequencers in biology are putting massive volumes of data into the hands

of scientists. For example, the Large Synoptic Survey Telescope [1] generates on the

order of 30 TB of data every day. Many enterprises continuously collect large data

sets that record customer interactions, product sales, results from advertising cam-

paigns on the Web, and other types of information. For example, Facebook collects

15 TB of data each day into a petabyte-scale data warehouse [123]. Jim Gray, called

the shift a “fourth paradigm” [69]. The first three paradigms were experimental,

theoretical and, more recently, computational science . Gray argued that the only

way to cope with this paradigm is to develop a new generation of computing tools to

manage, visualize, and analyze the data flood. In general, current computer archi-

tectures are increasingly imbalanced where the latency gap between multicore CPUs

and mechanical hard disks is growing every year, which makes the challenges of

data-intensive computing much harder to overcome [17]. Hence, there is a crucial

need for a systematic and generic approach to tackle these problems with an archi-

tecture that can also scale into the foreseeable future. In response, Gray argued that

the new trend should instead focus on supporting cheaper clusters of computers to

manage and process all this data instead of focusing on having the biggest and fastest

single computer.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home