MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 3. The Hadoop Distributed

Filesystem

When a dataset outgrows the storage capacity of a single physical machine, it becomes ne-

cessary to partition it across a number of separate machines. Filesystems that manage the

storage across a network of machines are called distributed filesystems . Since they are net-

work based, all the complications of network programming kick in, thus making distributed

filesystems more complex than regular disk filesystems. For example, one of the biggest

challenges is making the filesystem tolerate node failure without suffering data loss.

Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Dis-

tributed Filesystem . (You may sometimes see references to “DFS” — informally or in older

documentation or configurations — which is the same thing.) HDFS is Hadoop's flagship

filesystem and is the focus of this chapter, but Hadoop actually has a general-purpose

filesystem abstraction, so we'll see along the way how Hadoop integrates with other stor-

age systems (such as the local filesystem and Amazon S3).

Search WWH ::

Custom Search

Home