Database Reference
In-Depth Information
In the last two decades, the continuous increase of computational power has pro-
duced an overwhelming flow of data, which has called for a paradigm shift in the
computing architecture and large-scale data-processing mechanisms. MapReduce
is a simple and powerful programming model that enables easy development of
scalable parallel applications to process vast amounts of data on large clusters of
commodity machines. It isolates the application from the details of running a dis-
tributed program such as issues on data distribution, scheduling, and fault tolerance.
However, the original implementation of the MapReduce framework had some limi-
tations that have been tackled by many research efforts in several follow-up works
after its introduction. This chapter provides a comprehensive survey for a family of
approaches and mechanisms of large-scale data-processing mechanisms that have
been implemented based on the original idea of the MapReduce framework and
are currently gaining a lot of momentum in both research and industrial communi-
ties. We also cover a set of systems that have been implemented to provide declara-
tive programming interfaces on top of the MapReduce framework. In addition, we
discuss a set of MapReduce-based approaches for processing massive data sets of
different data models (e.g., XML, RDF, Graphs) and computationally expensive
data-intensive operations. In addition, we review several large-scale data-processing
systems that resemble some of the ideas of the MapReduce framework for different
purposes and application scenarios. Finally, we discuss some of the future research
directions for implementing the next generation of MapReduce-like solutions.
2.1 INTRODUCTION
We live in the era of Big Data where we are witnessing a continuous increase on the
computational power that produces an overwhelming flow of data, which has called
for a paradigm shift in the computing architecture and large-scale data-processing
mechanisms. Powerful telescopes in astronomy, particle accelerators in physics, and
genome sequencers in biology are putting massive volumes of data into the hands
of scientists. For example, the Large Synoptic Survey Telescope [1] generates on the
order of 30 TB of data every day. Many enterprises continuously collect large data
sets that record customer interactions, product sales, results from advertising cam-
paigns on the Web, and other types of information. For example, Facebook collects
15 TB of data each day into a petabyte-scale data warehouse [123]. Jim Gray, called
the shift a “fourth paradigm” [69]. The first three paradigms were experimental,
theoretical and, more recently, computational science . Gray argued that the only
way to cope with this paradigm is to develop a new generation of computing tools to
manage, visualize, and analyze the data flood. In general, current computer archi-
tectures are increasingly imbalanced where the latency gap between multicore CPUs
and mechanical hard disks is growing every year, which makes the challenges of
data-intensive computing much harder to overcome [17]. Hence, there is a crucial
need for a systematic and generic approach to tackle these problems with an archi-
tecture that can also scale into the foreseeable future. In response, Gray argued that
the new trend should instead focus on supporting cheaper clusters of computers to
manage and process all this data instead of focusing on having the biggest and fastest
single computer.
Search WWH ::




Custom Search