Database Reference
In-Depth Information
Comparison with Other Systems
Hadoop isn't the first distributed system for data storage and analysis, but it has some
unique properties that set it apart from other systems that may seem similar. Here we look
at some of them.
Relational Database Management Systems
Why can't we use databases with lots of disks to do large-scale analysis? Why is Hadoop
needed?
The answer to these questions comes from another trend in disk drives: seek time is im-
proving more slowly than transfer rate. Seeking is the process of moving the disk's head to
a particular place on the disk to read or write data. It characterizes the latency of a disk op-
eration, whereas the transfer rate corresponds to a disk's bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write large
portions of the dataset than streaming through it, which operates at the transfer rate. On the
other hand, for updating a small proportion of records in a database, a traditional B-Tree
(the data structure used in relational databases, which is limited by the rate at which it can
perform seeks) works well. For updating the majority of a database, a B-Tree is less effi-
cient than MapReduce, which uses Sort/Merge to rebuild the database.
In many ways, MapReduce can be seen as a complement to a Relational Database Manage-
ment System (RDBMS). (The differences between the two systems are shown in
Table 1-1 .) MapReduce is a good fit for problems that need to analyze the whole dataset in
a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or
updates, where the dataset has been indexed to deliver low-latency retrieval and update
times of a relatively small amount of data. MapReduce suits applications where the data is
written once and read many times, whereas a relational database is good for datasets that
are continually updated. [ 7 ]
Search WWH ::




Custom Search