Data Transfer - Field Guide to Hadoop

Database Reference

In-Depth Information

Hadoop Integration Fully Integrated

If you have a Hadoop cluster and worry what would happen if the entire cluster became un-

usable, you have a disaster recovery (DR) or continuity of operations (COOP) issue. There

are several strategies for dealing with this. One solution might be to load all data into both a

primary Hadoop cluster and a backup cluster located remotely from the primary cluster. This

is frequently called dual ingest. Then you would have to run every job on the primary cluster

on the remote cluster to keep the result files in sync. While feasible, this is managerially

complex. You might want to consider using a built-in part of Apache Hadoop called DistCp.

Short for distributed copy, DistCP is the primary tool for moving data between Hadoop

clusters. You may want to use DistCp for other reasons as well, such as moving data from a

test or development cluster to a production cluster. Commercial Hadoop distributions have

tools to deal with DR and COOP. Some are built on top of DistCp.

Tutorial Links

Likely as a result of the single-minded simplicity of DistCp, there aren't a whole lot of dedic-

ated tutorials about the technology. Readers who are interested in digging deeper are invited

to start with the official project page .

Example Code

Here's how you would copy a file named source-file in the source system n1 in the source-

dir to destination system n2 , where the hostnames n1 and n2 are the hostnames of the node

on which the NameNode lives for the source and destination, respectively. If you were using

this code snippet in a DR situation, the source-dir and dest-dir would be the same, as would

be the source-file and dest-file :

$ hadoop distcp hdfs://n1:8020/source-dir/source-file \

hdfs://n2:8020/dest-dir/dest-file

Search WWH ::

Custom Search

Home