Storage Provisioning and Networking - Deploying and Managing a Cloud Infrastructure

Information Technology Reference

In-Depth Information

It is a distributed file system designed to run on commodity hardware. HDFS displays

many similarities with existing distributed file systems, some of which have been mentioned

already. It is highly fault-tolerant because of configurable degree of data replication. Further

it can it well into a low-cost system design as it can be deployed on low-cost hardware.

HDFS provides the ability to handle very large data sets (in terms of tens or hundreds of TB)

and provides high throughput access to application data.

Features

HDFS was built with the following goals in mind:

HDFS is tuned to support large files. A typical large file on HDFS could very well be in

the range of a few hundred GB to a few TB in size. Since one file will be distributed across

the cluster in the shape of multiple smaller files, HDFS should first scale to hundreds of

nodes and should provide high aggregate data bandwidth.

■

HDFS acknowledges that failures are the norm rather than the exception. A typical

production-level HDFS cluster may consist of hundreds of machines, each storing part

of the data. Given that there are a number of components, each with a probability of

failure, it is always likely that one or more of HDFS components are nonfunctional at

any given time. HDFS implements automatic detection of failures and quick and auto-

matic recovery from those failures to support overall infrastructure.

■

The system is designed to be a high-throughput batch-processing system rather than

providing low-latency interactive usage. However, varying implementations of HDFS

exist (such as Hadoop Online Prototype) that provide the ability to implement interac-

tive consumption of processed data in real time.

■

HDFS employs a simple coherency model whereby appending writes to a file are dropped

in favor of a write-once strategy. Thus, once a file is created, written to HDFS, and closed,

it need not be changed. Such a model proves to be extremely efficient in providing high-

throughput access because it avoids all the tedious tasks of serialization/deserialization,

file-hold locking mechanisms, and repeated verification of the continuously growing file.

■

Moving the computation to the data is often considered to be more efficient than

moving the data it operates on. This is especially true for very large data sets. HDFS

provides interfaces for applications to migrate themselves closer to the data.

■

Since the HDFS is based on highly portable Java language technology, the system itself

is also portable from one platform to another.

■

HDFS has highly scalable master/slave architecture. An HDFS consists of a NameNode,

which is a master node that manages the file system namespace and governs access to

files by clients. The NameNode is also the repository for all HDFS metadata. The user

data does not flow through the NameNode. Instead, there are DataNodes that man-

age storage attached to the nodes. HDFS exposes a file system with behavior similar to

the local system. Underneath, a file is internally split into one or more blocks, and these

blocks are stored across a set of DataNodes. The algorithm makes sure that the data is

■

Deploying and Managing a Cloud Infrastructure

Search WWH ::

Custom Search

Home