Database Reference
In-Depth Information
Hadoop Integration Fully Integrated
If you have a Hadoop cluster and worry what would happen if the entire cluster became un-
usable, you have a disaster recovery (DR) or continuity of operations (COOP) issue. There
are several strategies for dealing with this. One solution might be to load all data into both a
primary Hadoop cluster and a backup cluster located remotely from the primary cluster. This
is frequently called dual ingest. Then you would have to run every job on the primary cluster
on the remote cluster to keep the result files in sync. While feasible, this is managerially
complex. You might want to consider using a built-in part of Apache Hadoop called DistCp.
Short for distributed copy, DistCP is the primary tool for moving data between Hadoop
clusters. You may want to use DistCp for other reasons as well, such as moving data from a
test or development cluster to a production cluster. Commercial Hadoop distributions have
tools to deal with DR and COOP. Some are built on top of DistCp.
Tutorial Links
Likely as a result of the single-minded simplicity of DistCp, there aren't a whole lot of dedic-
ated tutorials about the technology. Readers who are interested in digging deeper are invited
to start with the official project page .
Example Code
Here's how you would copy a file named source-file in the source system n1 in the source-
dir to destination system n2 , where the hostnames n1 and n2 are the hostnames of the node
on which the NameNode lives for the source and destination, respectively. If you were using
this code snippet in a DR situation, the source-dir and dest-dir would be the same, as would
be the source-file and dest-file :
$ hadoop distcp hdfs://n1:8020/source-dir/source-file \
hdfs://n2:8020/dest-dir/dest-file
Search WWH ::




Custom Search