MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

The Design of HDFS

HDFS is a filesystem designed for storing very large files with streaming data access pat-

terns, running on clusters of commodity hardware. [ 25 ] Let's examine this statement in more

detail:

Very large files

“Very large” in this context means files that are hundreds of megabytes, gigabytes, or

terabytes in size. There are Hadoop clusters running today that store petabytes of

data. [ 26 ]

Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-

once, read-many-times pattern. A dataset is typically generated or copied from source,

and then various analyses are performed on that dataset over time. Each analysis will in-

volve a large proportion, if not all, of the dataset, so the time to read the whole dataset is

more important than the latency in reading the first record.

Commodity hardware

Hadoop doesn't require expensive, highly reliable hardware. It's designed to run on

clusters of commodity hardware (commonly available hardware that can be obtained

from multiple vendors) [ 27 ] for which the chance of node failure across the cluster is

high, at least for large clusters. HDFS is designed to carry on working without a notice-

able interruption to the user in the face of such failure.

It is also worth examining the applications for which using HDFS does not work so well.

Although this may change in the future, these are areas where HDFS is not a good fit

today:

Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds range,

will not work well with HDFS. Remember, HDFS is optimized for delivering a high

throughput of data, and this may be at the expense of latency. HBase (see Chapter 20 ) is

currently a better choice for low-latency access.

Lots of small files

Search WWH ::

Custom Search

Home