Information Technology Reference
In-Depth Information
It is a distributed file system designed to run on commodity hardware. HDFS displays
many similarities with existing distributed file systems, some of which have been mentioned
already. It is highly fault-tolerant because of configurable degree of data replication. Further
it can it well into a low-cost system design as it can be deployed on low-cost hardware.
HDFS provides the ability to handle very large data sets (in terms of tens or hundreds of TB)
and provides high throughput access to application data.
Features
HDFS was built with the following goals in mind:
HDFS is tuned to support large files. A typical large file on HDFS could very well be in
the range of a few hundred GB to a few TB in size. Since one file will be distributed across
the cluster in the shape of multiple smaller files, HDFS should first scale to hundreds of
nodes and should provide high aggregate data bandwidth.
HDFS acknowledges that failures are the norm rather than the exception. A typical
production-level HDFS cluster may consist of hundreds of machines, each storing part
of the data. Given that there are a number of components, each with a probability of
failure, it is always likely that one or more of HDFS components are nonfunctional at
any given time. HDFS implements automatic detection of failures and quick and auto-
matic recovery from those failures to support overall infrastructure.
The system is designed to be a high-throughput batch-processing system rather than
providing low-latency interactive usage. However, varying implementations of HDFS
exist (such as Hadoop Online Prototype) that provide the ability to implement interac-
tive consumption of processed data in real time.
HDFS employs a simple coherency model whereby appending writes to a file are dropped
in favor of a write-once strategy. Thus, once a file is created, written to HDFS, and closed,
it need not be changed. Such a model proves to be extremely efficient in providing high-
throughput access because it avoids all the tedious tasks of serialization/deserialization,
file-hold locking mechanisms, and repeated verification of the continuously growing file.
Moving the computation to the data is often considered to be more efficient than
moving the data it operates on. This is especially true for very large data sets. HDFS
provides interfaces for applications to migrate themselves closer to the data.
Since the HDFS is based on highly portable Java language technology, the system itself
is also portable from one platform to another.
HDFS has highly scalable master/slave architecture. An HDFS consists of a NameNode,
which is a master node that manages the file system namespace and governs access to
files by clients. The NameNode is also the repository for all HDFS metadata. The user
data does not flow through the NameNode. Instead, there are DataNodes that man-
age storage attached to the nodes. HDFS exposes a file system with behavior similar to
the local system. Underneath, a file is internally split into one or more blocks, and these
blocks are stored across a set of DataNodes. The algorithm makes sure that the data is
Search WWH ::




Custom Search