Database Reference
In-Depth Information
Understanding HDFS backups
Data volumes in Hadoop clusters range from terabytes to petabytes, and deciding what data
to back up from such clusters is an important decision. A disaster recovery plan for Hadoop
clusters needs to be formulated right at the cluster planning stages. The organization needs
to identify the datasets they want to back up and plan backup storage requirements accord-
ingly.
Backup schedules also need to be considered when designing a backup solution. The larger
the data that needs to be backed up, the more time-consuming the activity. It would be
more efficient if backups could be performed during a window when there is the least
amount of activity on the cluster. This not only helps the backup commands to run effi-
ciently, but also ensures data consistency of the datasets being backed up. Knowing the
possible schedules of the data infusion to HDFS in advance helps you to better plan and
schedule backup solutions for Hadoop clusters.
The following are some of the important data sources that need to be protected against data
loss:
The namenode metadata : The namenode metadata contains all the location of all
the files in the HDFS.
The Hive metastore : The Hive metastore contains the metadata for all Hive tables
and partitions.
HBase RegionServer data : This contains the information of all the HBase re-
gions.
Application configuration files : This comprises the important configuration files
required to configure Apache Hadoop. For example, core-site.xml , yarn-
site.xml , and hdfs-site.xml .
Data protection in Hadoop clusters is important as clusters are prone to data corruption,
hardware failures, and accidental data deletion. In rare cases, a data center catastrophe
could also lead to entire data loss.
Search WWH ::




Custom Search