Database Reference
In-Depth Information
Moving Data in HDFS
As your big data needs grow, it is not uncommon to create additional
Hadoop clusters. Additional clusters are also used to keep workloads
separate and to manage single-point-of-failure concerns that arise from
having asingle NameNode. Butwhathappens ifyouneedaccesstothesame
data from multiple clusters? You can export the data, using the dfs -get
command tomoveitbacktoalocalfilesystemandthe dfs -put command
to put into the new cluster. However, this is likely to be slow and take a large
amount of additional disk space during the copying process.
Fortunately, a tool in HDFS makes this easier: distcp (Distributed Copy).
distcp enables a distributed approach to copying large amounts of data.
It does this by leveraging MapReduce to distribute the copy process to
multiple DataNodes in the cluster. The list of files to be copied is placed in
a list, along with any related directories. Then the list is partitioned by the
availablenodes,andeachnodebecomesresponsibleforcopyingitsassigned
files.
distcp can be executed by running the distcp module with two
arguments: the source directory and the target directory. To reference a
different cluster, you use a fully qualified name for the NameNode:
hadoop distcp hdfs://mynamenode:50010/user/
MSBigDataSolutions \
hdfs://mybackupcluster:50010/user/
MSBigDataSolutions
distcp can also be used for copying data inside the same cluster. This is
useful if you need to copy a large amount of data for backup purposes.
Search WWH ::




Custom Search