Storing and Managing Data in HDFS - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Moving Data in HDFS

As your big data needs grow, it is not uncommon to create additional

Hadoop clusters. Additional clusters are also used to keep workloads

separate and to manage single-point-of-failure concerns that arise from

having asingle NameNode. Butwhathappens ifyouneedaccesstothesame

data from multiple clusters? You can export the data, using the dfs -get

command tomoveitbacktoalocalfilesystemandthe dfs -put command

to put into the new cluster. However, this is likely to be slow and take a large

amount of additional disk space during the copying process.

Fortunately, a tool in HDFS makes this easier: distcp (Distributed Copy).

distcp enables a distributed approach to copying large amounts of data.

It does this by leveraging MapReduce to distribute the copy process to

multiple DataNodes in the cluster. The list of files to be copied is placed in

a list, along with any related directories. Then the list is partitioned by the

availablenodes,andeachnodebecomesresponsibleforcopyingitsassigned

files.

distcp can be executed by running the distcp module with two

arguments: the source directory and the target directory. To reference a

different cluster, you use a fully qualified name for the NameNode:

hadoop distcp hdfs://mynamenode:50010/user/

MSBigDataSolutions \

hdfs://mybackupcluster:50010/user/

MSBigDataSolutions

distcp can also be used for copying data inside the same cluster. This is

useful if you need to copy a large amount of data for backup purposes.

Search WWH ::

Custom Search

Home