Database Reference
In-Depth Information
Chapter 3. The Hadoop Distributed
Filesystem
When a dataset outgrows the storage capacity of a single physical machine, it becomes ne-
cessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed filesystems . Since they are net-
work based, all the complications of network programming kick in, thus making distributed
filesystems more complex than regular disk filesystems. For example, one of the biggest
challenges is making the filesystem tolerate node failure without suffering data loss.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Dis-
tributed Filesystem . (You may sometimes see references to “DFS” — informally or in older
documentation or configurations — which is the same thing.) HDFS is Hadoop's flagship
filesystem and is the focus of this chapter, but Hadoop actually has a general-purpose
filesystem abstraction, so we'll see along the way how Hadoop integrates with other stor-
age systems (such as the local filesystem and Amazon S3).
Search WWH ::




Custom Search