Meet Hadoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Comparison with Other Systems

Hadoop isn't the first distributed system for data storage and analysis, but it has some

unique properties that set it apart from other systems that may seem similar. Here we look

at some of them.

Relational Database Management Systems

Why can't we use databases with lots of disks to do large-scale analysis? Why is Hadoop

needed?

The answer to these questions comes from another trend in disk drives: seek time is im-

proving more slowly than transfer rate. Seeking is the process of moving the disk's head to

a particular place on the disk to read or write data. It characterizes the latency of a disk op-

eration, whereas the transfer rate corresponds to a disk's bandwidth.

If the data access pattern is dominated by seeks, it will take longer to read or write large

portions of the dataset than streaming through it, which operates at the transfer rate. On the

other hand, for updating a small proportion of records in a database, a traditional B-Tree

(the data structure used in relational databases, which is limited by the rate at which it can

perform seeks) works well. For updating the majority of a database, a B-Tree is less effi-

cient than MapReduce, which uses Sort/Merge to rebuild the database.

In many ways, MapReduce can be seen as a complement to a Relational Database Manage-

ment System (RDBMS). (The differences between the two systems are shown in

Table 1-1 .) MapReduce is a good fit for problems that need to analyze the whole dataset in

a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or

updates, where the dataset has been indexed to deliver low-latency retrieval and update

times of a relatively small amount of data. MapReduce suits applications where the data is

written once and read many times, whereas a relational database is good for datasets that

are continually updated. [ 7 ]

Search WWH ::

Custom Search

Home