Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

express ad hoc queries. In contrast to other systems such as Pig [109] or Hive [123],

it executes queries natively without translating them into MapReduce jobs. In par-

ticular, Dremel is designed to execute many queries that would ordinarily require a

sequence of MapReduce jobs.

2.7 CONCLUSIONS

The database community has always been focusing on dealing with the challenges

of Big Data management, although the meaning of “ big ” has been evolving continu-

ously to represent different scales over the time [24]. According to IBM, we are cur-

rently creating 2.5 quintillion bytes of data, every day. This data comes from many

different sources and in different formats including digital pictures, videos, posts to

social media sites, intelligent sensors, purchase transaction records, and cell phone

GPS signals. This is a new scale of Big Data , which is attracting a huge interest from

both the industrial and research communities with the aim of creating the best means

to process and analyze this data to make the best use of it. In the last decade, the

MapReduce framework has emerged as a popular mechanism to harness the power

of large clusters of computers. It allows programmers to think in a data-centric fash-

ion where they can focus on applying transformations to sets of data records, while

the details of distributed execution and fault tolerance are transparently managed by

the MapReduce framework.

In this chapter, we presented a survey of the MapReduce family of approaches for

developing scalable data-processing systems and solutions. In general, we notice that

although the MapReduce framework and its open-source implementation of Hadoop

are now considered to be sufficiently mature such that they are widely used for devel-

oping many solutions by academia and industry in different application domains, we

believe that it is unlikely that MapReduce will completely replace database systems

even for data warehousing applications. We expect that they will always coexist and

complement each others in different scenarios. We are also convinced that there is

still room for further optimization and advancement in different directions on the

spectrum of the MapReduce framework that is required to bring forward the vision

of providing large-scale data analysis as a commodity for novice end-users. For

example, energy efficiency in the MapReduce is an important problem, which has

not attracted sufficient attention from the research community, yet. The traditional

challenge of debugging large-scale computations on distributed systems has not been

given sufficient consideration by the MapReduce research community. Related with

the issue of the power of expressiveness of the programming model, we feel that this

is an area that requires more investigation. We also noticed that the over simplicity

of the MapReduce programming model have raised some key challenges on dealing

with complex data models (e.g., nested models, XML and hierarchical model, RDF,

and graphs) efficiently. This limitation has called for the need of next generation of Big

Data architectures and systems that can provide the required scale and performance

attributes for these domain. For example, Google has created the Dremel system

[99], commercialized under the name of BigQuery ,* to support interactive analysis

* https://developers.google.com/bigquery/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home