Database Reference
In-Depth Information
main aim of these systems is to improve performance through the parallelization
of various operations such as loading data, building indices, and evaluating queries.
These systems are usually designed to run on top of a shared-nothing architecture
[120] where data may be stored in a distributed fashion and input/output speeds are
improved using multiple CPUs and disks in parallel. On the other hand, there are
some key reasons that make MapReduce a more preferable approach over a parallel
RDBMS in some scenarios such as [20]
Formatting and loading a huge amount of data into a parallel RDBMS in a
timely manner is a challenging and time-consuming task.
The input data records may not always follow the same schema. Developers
often want the flexibility to add and drop attributes and the interpretation of
an input data record may also change over time.
Large-scale data processing can be very time consuming, and therefore,
it is important to keep the analysis job going even in the event of failures.
While most parallel RDBMSs have fault tolerance support, a query usually
has to be restarted from scratch even if just one node in the cluster fails. In
contrast, MapReduce deals with failures in a more graceful manner and can
redo only the part of the computation that was lost due to the failure.
There has been a long debate on the comparison between the MapReduce frame-
work and parallel database systems* [121]. Pavlo et al. [113] have conducted a large-
scale comparison between the Hadoop implementation of MapReduce framework
and parallel SQL database management systems in terms of performance and devel-
opment complexity. The results of this comparison have shown that parallel data-
base systems displayed a significant performance advantage over MapReduce in
executing a variety of data-intensive analysis tasks. On the other hand, the Hadoop
implementation was very much easier and more straightforward to set up and use in
comparison to that of the parallel database systems. MapReduce have also shown
to have superior performance in minimizing the amount of work that is lost when a
hardware failure occurs. In addition, MapReduce (with its open-source implementa-
tions) represents a very cheap solution in comparison to the very financially expen-
sive parallel DBMS solutions (the price of an installation of a parallel DBMS cluster
usually consists of seven figures of U.S. dollars) [121].
The HadoopDB project is a hybrid system that tries to combine the scalability
advantages of MapReduce with the performance and efficiency advantages of paral-
lel databases [3]. The basic idea behind HadoopDB is to connect multiple single-node
database systems (Post-greSQL) using Hadoop as the task coordinator and network
communication layer. Queries are expressed in SQL but their execution are parallel-
ized across nodes using the MapReduce framework, however, as much of the single-
node query work as possible is pushed inside of the corresponding node databases.
Thus, HadoopDB tries to achieve fault tolerance and the ability to operate in hetero-
geneous environments by inheriting the scheduling and job-tracking implementation
* http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/.
http://db.cs.yale.edu/hadoopdb/hadoopdb.html.
Search WWH ::




Custom Search