Configuring Backups - Cloudera Administration

Database Reference

In-Depth Information

Understanding HDFS backups

Data volumes in Hadoop clusters range from terabytes to petabytes, and deciding what data

to back up from such clusters is an important decision. A disaster recovery plan for Hadoop

clusters needs to be formulated right at the cluster planning stages. The organization needs

to identify the datasets they want to back up and plan backup storage requirements accord-

ingly.

Backup schedules also need to be considered when designing a backup solution. The larger

the data that needs to be backed up, the more time-consuming the activity. It would be

more efficient if backups could be performed during a window when there is the least

amount of activity on the cluster. This not only helps the backup commands to run effi-

ciently, but also ensures data consistency of the datasets being backed up. Knowing the

possible schedules of the data infusion to HDFS in advance helps you to better plan and

schedule backup solutions for Hadoop clusters.

The following are some of the important data sources that need to be protected against data

loss:

• The namenode metadata : The namenode metadata contains all the location of all

the files in the HDFS.

• The Hive metastore : The Hive metastore contains the metadata for all Hive tables

and partitions.

• HBase RegionServer data : This contains the information of all the HBase re-

gions.

• Application configuration files : This comprises the important configuration files

required to configure Apache Hadoop. For example, core-site.xml , yarn-

site.xml , and hdfs-site.xml .

Data protection in Hadoop clusters is important as clusters are prone to data corruption,

hardware failures, and accidental data deletion. In rare cases, a data center catastrophe

could also lead to entire data loss.

Search WWH ::

Custom Search

Home