Database Reference
In-Depth Information
WARNING
Do not make the mistake of thinking that HDFS replication is a substitute for making backups. Bugs in
HDFS can cause replicas to be lost, and so can hardware failures. Although Hadoop is expressly de-
signed so that hardware failure is very unlikely to result in data loss, the possibility can never be com-
pletely ruled out, particularly when combined with software bugs or human error.
When it comes to backups, think of HDFS in the same way as you would RAID. Although the data will
survive the loss of an individual RAID disk, it may not survive if the RAID controller fails or is buggy
(perhaps overwriting some data), or the entire array is damaged.
It's common to have a policy for user directories in HDFS. For example, they may have
space quotas and be backed up nightly. Whatever the policy, make sure your users know
what it is, so they know what to expect.
The distcp tool is ideal for making backups to other HDFS clusters (preferably running on
a different version of the software, to guard against loss due to bugs in HDFS) or other
Hadoop filesystems (such as S3) because it can copy files in parallel. Alternatively, you
can employ an entirely different storage system for backups, using one of the methods for
exporting data from HDFS described in Hadoop Filesystems .
HDFS allows administrators and users to take snapshots of the filesystem. A snapshot is a
read-only copy of a filesystem subtree at a given point in time. Snapshots are very effi-
cient since they do not copy data; they simply record each file's metadata and block list,
which is sufficient to reconstruct the filesystem contents at the time the snapshot was
taken.
Snapshots are not a replacement for data backups, but they are a useful tool for point-in-
time data recovery for files that were mistakenly deleted by users. You might have a
policy of taking periodic snapshots and keeping them for a specific period of time accord-
ing to age. For example, you might keep hourly snapshots for the previous day and daily
snapshots for the previous month.
Filesystem check (fsck)
It is advisable to run HDFS's fsck tool regularly (i.e., daily) on the whole filesystem to
proactively look for missing or corrupt blocks. See Filesystem check (fsck) .
Filesystem balancer
Run the balancer tool (see Balancer ) regularly to keep the filesystem datanodes evenly
balanced.
Search WWH ::




Custom Search