Database Reference
In-Depth Information
The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access pat-
terns, running on clusters of commodity hardware. [ 25 ] Let's examine this statement in more
detail:
Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of
data. [ 26 ]
Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-
once, read-many-times pattern. A dataset is typically generated or copied from source,
and then various analyses are performed on that dataset over time. Each analysis will in-
volve a large proportion, if not all, of the dataset, so the time to read the whole dataset is
more important than the latency in reading the first record.
Commodity hardware
Hadoop doesn't require expensive, highly reliable hardware. It's designed to run on
clusters of commodity hardware (commonly available hardware that can be obtained
from multiple vendors) [ 27 ] for which the chance of node failure across the cluster is
high, at least for large clusters. HDFS is designed to carry on working without a notice-
able interruption to the user in the face of such failure.
It is also worth examining the applications for which using HDFS does not work so well.
Although this may change in the future, these are areas where HDFS is not a good fit
today:
Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds range,
will not work well with HDFS. Remember, HDFS is optimized for delivering a high
throughput of data, and this may be at the expense of latency. HBase (see Chapter 20 ) is
currently a better choice for low-latency access.
Lots of small files
Search WWH ::




Custom Search