Database Reference
In-Depth Information
Parallel Copying with distcp
The HDFS access patterns that we have seen so far focus on single-threaded access. It's
possible to act on a collection of files — by specifying file globs, for example — but for ef-
ficient parallel processing of these files, you would have to write a program yourself. Ha-
doop comes with a useful program called distcp for copying data to and from Hadoop
filesystems in parallel.
One use for distcp is as an efficient replacement for hadoop fs -cp . For example, you
can copy one file to another with: [ 34 ]
% hadoop distcp file1 file2
You can also copy directories:
% hadoop distcp dir1 dir2
If dir2 does not exist, it will be created, and the contents of the dir1 directory will be
copied there. You can specify multiple source paths, and all will be copied to the destina-
tion.
If dir2 already exists, then dir1 will be copied under it, creating the directory structure dir2/
dir1 . If this isn't what you want, you can supply the -overwrite option to keep the same
directory structure and force files to be overwritten. You can also update only the files that
have changed using the -update option. This is best shown with an example. If we
changed a file in the dir1 subtree, we could synchronize the change with dir2 by running:
% hadoop distcp -update dir1 dir2
TIP
If you are unsure of the effect of a distcp operation, it is a good idea to try it out on a small test directory
tree first.
distcp is implemented as a MapReduce job where the work of copying is done by the maps
that run in parallel across the cluster. There are no reducers. Each file is copied by a single
map, and distcp tries to give each map approximately the same amount of data by bucket-
ing files into roughly equal allocations. By default, up to 20 maps are used, but this can be
changed by specifying the -m argument to distcp .
A very common use case for distcp is for transferring data between two HDFS clusters. For
example, the following creates a backup of the first cluster's /foo directory on the second:
Search WWH ::




Custom Search