Database Reference
In-Depth Information
HDFS Architecture
Traditionally, datahasbeencentralizedratherthanspreadout.Thatworked
well over the past few decades, as the capability to store ever-increasing
amounts of data on a single disk continued to grow. For example, in 1981,
you could purchase hard drives that stored around 20MB at a cost of
approximately $ 180 per MB. By 2007, you could get a drive that stored 1TB
at cost of about $ 0.0004 per MB.
Today, storage needs in big-data scenarios continue to outpace the capacity
of even the largest drives (4TB). One early solution to this problem was
simply to add more hard drives. If you wanted to store 1 petabyte (1,024TB)
of information, you would need 256 4TB hard drives. However, if all the
hard drives were placed in the same server, it introduced a single point of
failure. Any problem that affected the server could mean the drives weren't
accessible, and so the data on the drives could be neither read nor written.
The single computer could also introduce a performance bottleneck for
access to the hard drives.
HDFS was designed to solve this problem by supporting distribution of the
data storage across many nodes. Because the data is spread across multiple
nodes, no single computer becomes a bottleneck. By storing redundant
copies of the information (discussed in more detail in the section “Data
Replication”), a single point of failure is also removed.
This redundancy also enables the use of commodity hardware. (Commodity
means nonspecialized, off-the-shelf components.) Special hardware or a
unique configuration is not needed for a computer to participate in an
HDFS cluster. Commodity hardware tends to be less expensive than more
specialized components and can be acquired at a wider variety of vendors.
Many of today's server-class computers include a number of features
designed to minimize downtime. This includes things like redundant power
supplies, multiple network interfaces, and hard drive controllers capable
of managing pools of hard drives in redundant arrays of independent/
inexpensive disk (RAID) setups. Thanks to the data redundancy inherent in
HDFS, it can minimize the need for this level of hardware, allowing the use
of less-expensive computers.
Search WWH ::




Custom Search