Database Reference
In-Depth Information
Hadoop Distributed File System (HDFS)
License
Apache License, Version 2.0
Activity
High
Purpose
High capacity, fault tolerant, inexpensive storage of very large datasets
Official Page
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
Hadoop Integration Fully Integrated
The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster where you
store data. Built for data-intensive applications, the HDFS is designed to run on clusters of
inexpensive commodity servers. HDFS is optimized for high-performance, read-intensive
operations, and is resilient to failures in the cluster. It does not prevent failures, but is un-
likely to lose data, because HDFS by default makes multiple copies of each of its data
blocks. Moreover, HDFS is a write once, read many (or WORM-ish) filesystem: once a file
is created, the filesystem API only allows you to append to the file, not to overwrite it. As a
result, HDFS is usually inappropriate for normal online transaction processing (OLTP) ap-
plications. Most uses of HDFS are for sequential reads of large files. These files are broken
into large blocks, usually 64 MB or larger in size, and these blocks are distributed among the
nodes in the server.
HDFS is not a POSIX-compliant filesystem as you would see on Linux, Mac OS X, and on
some Windows platforms (see the POSIX Wikipedia page for a brief explanation). It is not
managed by the OS kernels on the nodes in the server. Blocks in HDFS are mapped to files
in the host's underlying filesystem, often ext3 in Linux systems. HDFS does not assume that
the underlying disks in the host are RAID protected, so by default, three copies of each block
are made and are placed on different nodes in the cluster. This provides protection against
lost data when nodes or disks fail and assists in Hadoop's notion of accessing data where it
resides, rather than moving it through a network to access it.
Although an explanation is beyond the scope of this topic, metadata about the files in the
HDFS is managed through a NameNode, the Hadoop equivalent of the Unix/Linux superb-
lock.
Search WWH ::




Custom Search