Database Reference
In-Depth Information
Essentials of HDFS
HDFS is a distributed filesystem that has been designed to run on top of a cluster of in-
dustry standard hardware. The architecture of HDFS is such that there is no specific need
for high-end hardware. HDFS is a highly fault-tolerant system and can handle failures of
nodes in a cluster without loss of data. The primary goal behind the design of HDFS is to
serve large data files efficiently. HDFS achieves this efficiency and high throughput in data
transfer by enabling streaming access to the data in the filesystem.
The following are the important features of HDFS:
Fault tolerance : Many computers working together as a cluster are bound to have
hardware failures. Hardware failures such as disk failures, network connectivity is-
sues, and RAM failures could disrupt processing and cause major downtime. This
could lead to data loss as well slippage of critical SLAs. HDFS is designed to with-
stand such hardware failures by detecting faults and taking recovery actions as re-
quired.
The data in HDFS is split across the machines in the cluster as chunks of data
called blocks . These blocks are replicated across multiple machines of the cluster
for redundancy. So, even if a node/machine becomes completely unusable and
shuts down, the processing can go on with the copy of the data present on the
nodes where the data was replicated.
Streaming data : Streaming access enables data to be transferred in the form of a
steady and continuous stream. This means if data from a file in HDFS needs to be
processed, HDFS starts sending the data as it reads the file and does not wait for
the entire file to be read. The client who is consuming this data starts processing
the data immediately, as it receives the stream from HDFS. This makes data pro-
cessing really fast.
Large data store : HDFS is used to store large volumes of data. HDFS functions
best when the individual data files stored are large files, rather than having large
number of small files. File sizes in most Hadoop clusters range from gigabytes to
terabytes. The storage scales linearly as more nodes are added to the cluster.
Portable : HDFS is a highly portable system. Since it is built on Java, any machine
or operating system that can run Java should be able to run HDFS. Even at the
hardware layer, HDFS is flexible and runs on most of the commonly available
hardware platforms. Most production level clusters have been set up on commodity
hardware.
Search WWH ::




Custom Search