Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

that although the MapReduce framework, and its open source implementation of

Hadoop, are now considered to be sufficiently mature such that they are widely used

for developing many solutions by academia and industry in different application

domains. We believe that it is unlikely that MapReduce will completely replace

database systems even for data warehousing applications. We expect that they will

always coexist and complement each others in different scenarios. We are also

convinced that there is still room for further optimization and advancement in

different directions on the spectrum of the MapReduce framework that is required

to bring forward the vision of providing large scale data analysis as a commodity

for novice end-users. For example, energy efficiency in the MapReduce is an

important problem which has not attracted sufficient attention from the research

community, yet. The traditional challenge of debugging large scale computations

on distributed system has not been given sufficient consideration by the MapReduce

research community. Related with the issue of the power of expressiveness of the

programming model, we feel that this is an area that requires more investigation.

We also noticed that the over simplicity of the MapReduce programming model

have raised some key challenges on dealing with complex data models (e.g., nested

models, XML and hierarchical model , RDF and graphs) efficiently. This limitation

has called for the need of next-generation of big data architectures and systems that

can provide the required scale and performance attributes for these domain. For

example, Google has created the Dremel system [ 182 , 183 ], commercialized under

the name of BigQuery [ 22 ], to support interactive analysis of nested data. Google

has also presented the Pregel system [ 180 ], open sourced by Apache Giraph and

Apache Hama projects, that uses a BSP-based programming model for efficient

and scalable processing of massive graphs on distributed cluster of commodity

machines. Recently, Twitter has announced the release of the Storm [ 47 ]systemas

a distributed and fault-tolerant platform for implementing continuous and realtime

processing applications of streamed data. We believe that more of these domain-

specific systems will be introduced in the future to form the new generation of big

data systems. Defining the right and most convenient programming abstractions

and declarative interfaces of these domain-specific Big Data systems is another

important research direction that will need to be deeply investigated.

Search WWH ::

Custom Search

Home