Core Technologies - Field Guide to Hadoop

Database Reference

In-Depth Information

Hadoop Distributed File System (HDFS)

License

Apache License, Version 2.0

Activity

High

Purpose

High capacity, fault tolerant, inexpensive storage of very large datasets

Official Page

Hadoop Integration Fully Integrated

The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster where you

store data. Built for data-intensive applications, the HDFS is designed to run on clusters of

inexpensive commodity servers. HDFS is optimized for high-performance, read-intensive

operations, and is resilient to failures in the cluster. It does not prevent failures, but is un-

likely to lose data, because HDFS by default makes multiple copies of each of its data

blocks. Moreover, HDFS is a write once, read many (or WORM-ish) filesystem: once a file

is created, the filesystem API only allows you to append to the file, not to overwrite it. As a

result, HDFS is usually inappropriate for normal online transaction processing (OLTP) ap-

plications. Most uses of HDFS are for sequential reads of large files. These files are broken

into large blocks, usually 64 MB or larger in size, and these blocks are distributed among the

nodes in the server.

HDFS is not a POSIX-compliant filesystem as you would see on Linux, Mac OS X, and on

some Windows platforms (see the POSIX Wikipedia page for a brief explanation). It is not

managed by the OS kernels on the nodes in the server. Blocks in HDFS are mapped to files

in the host's underlying filesystem, often ext3 in Linux systems. HDFS does not assume that

the underlying disks in the host are RAID protected, so by default, three copies of each block

are made and are placed on different nodes in the cluster. This provides protection against

lost data when nodes or disks fail and assists in Hadoop's notion of accessing data where it

resides, rather than moving it through a network to access it.

Although an explanation is beyond the scope of this topic, metadata about the files in the

HDFS is managed through a NameNode, the Hadoop equivalent of the Unix/Linux superb-

lock.

Search WWH ::

Custom Search

Home