Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

main aim of these systems is to improve performance through the parallelization

of various operations such as loading data, building indices, and evaluating queries.

These systems are usually designed to run on top of a shared-nothing architecture

[120] where data may be stored in a distributed fashion and input/output speeds are

improved using multiple CPUs and disks in parallel. On the other hand, there are

some key reasons that make MapReduce a more preferable approach over a parallel

RDBMS in some scenarios such as [20]

•

Formatting and loading a huge amount of data into a parallel RDBMS in a

timely manner is a challenging and time-consuming task.

•

The input data records may not always follow the same schema. Developers

often want the flexibility to add and drop attributes and the interpretation of

an input data record may also change over time.

•

Large-scale data processing can be very time consuming, and therefore,

it is important to keep the analysis job going even in the event of failures.

While most parallel RDBMSs have fault tolerance support, a query usually

has to be restarted from scratch even if just one node in the cluster fails. In

contrast, MapReduce deals with failures in a more graceful manner and can

redo only the part of the computation that was lost due to the failure.

There has been a long debate on the comparison between the MapReduce frame-

work and parallel database systems* [121]. Pavlo et al. [113] have conducted a large-

scale comparison between the Hadoop implementation of MapReduce framework

and parallel SQL database management systems in terms of performance and devel-

opment complexity. The results of this comparison have shown that parallel data-

base systems displayed a significant performance advantage over MapReduce in

executing a variety of data-intensive analysis tasks. On the other hand, the Hadoop

implementation was very much easier and more straightforward to set up and use in

comparison to that of the parallel database systems. MapReduce have also shown

to have superior performance in minimizing the amount of work that is lost when a

hardware failure occurs. In addition, MapReduce (with its open-source implementa-

tions) represents a very cheap solution in comparison to the very financially expen-

sive parallel DBMS solutions (the price of an installation of a parallel DBMS cluster

usually consists of seven figures of U.S. dollars) [121].

The HadoopDB project † is a hybrid system that tries to combine the scalability

advantages of MapReduce with the performance and efficiency advantages of paral-

lel databases [3]. The basic idea behind HadoopDB is to connect multiple single-node

database systems (Post-greSQL) using Hadoop as the task coordinator and network

communication layer. Queries are expressed in SQL but their execution are parallel-

ized across nodes using the MapReduce framework, however, as much of the single-

node query work as possible is pushed inside of the corresponding node databases.

Thus, HadoopDB tries to achieve fault tolerance and the ability to operate in hetero-

geneous environments by inheriting the scheduling and job-tracking implementation

* http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/.

† http://db.cs.yale.edu/hadoopdb/hadoopdb.html.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home