Database Reference
In-Depth Information
Parallel Copying with distcp
The HDFS access patterns that we have seen so far focus on single-threaded access. It's
possible to act on a collection of files — by specifying file globs, for example — but for ef-
ficient parallel processing of these files, you would have to write a program yourself. Ha-
doop comes with a useful program called
distcp
for copying data to and from Hadoop
filesystems in parallel.
One use for
distcp
is as an efficient replacement for
hadoop fs -cp
. For example, you
can copy one file to another with:
[
34
]
%
hadoop distcp file1 file2
You can also copy directories:
%
hadoop distcp dir1 dir2
If
dir2
does not exist, it will be created, and the contents of the
dir1
directory will be
copied there. You can specify multiple source paths, and all will be copied to the destina-
tion.
If
dir2
already exists, then
dir1
will be copied under it, creating the directory structure
dir2/
dir1
. If this isn't what you want, you can supply the
-overwrite
option to keep the same
directory structure and force files to be overwritten. You can also update only the files that
have changed using the
-update
option. This is best shown with an example. If we
changed a file in the
dir1
subtree, we could synchronize the change with
dir2
by running:
%
hadoop distcp -update dir1 dir2
TIP
If you are unsure of the effect of a
distcp
operation, it is a good idea to try it out on a small test directory
tree first.
distcp
is implemented as a MapReduce job where the work of copying is done by the maps
that run in parallel across the cluster. There are no reducers. Each file is copied by a single
map, and
distcp
tries to give each map approximately the same amount of data by bucket-
ing files into roughly equal allocations. By default, up to 20 maps are used, but this can be
changed by specifying the
-m
argument to
distcp
.
A very common use case for
distcp
is for transferring data between two HDFS clusters. For
example, the following creates a backup of the first cluster's
/foo
directory on the second: