Database Reference
In-Depth Information
% hadoop distcp -update -delete -p hdfs://namenode1/foo
hdfs://namenode2/foo
The -delete flag causes distcp to delete any files or directories from the destination that
are not present in the source, and -p means that file status attributes like permissions,
block size, and replication are preserved. You can run distcp with no arguments to see pre-
cise usage instructions.
If the two clusters are running incompatible versions of HDFS, then you can use the
webhdfs protocol to distcp between them:
% hadoop distcp webhdfs://namenode1:50070/foo
webhdfs://namenode2:50070/foo
Another variant is to use an HttpFs proxy as the distcp source or destination (again using
the webhdfs protocol), which has the advantage of being able to set firewall and band-
width controls (see HTTP ) .
Keeping an HDFS Cluster Balanced
When copying data into HDFS, it's important to consider cluster balance. HDFS works
best when the file blocks are evenly spread across the cluster, so you want to ensure that
distcp doesn't disrupt this. For example, if you specified -m 1 , a single map would do the
copy, which — apart from being slow and not using the cluster resources efficiently —
would mean that the first replica of each block would reside on the node running the map
(until the disk filled up). The second and third replicas would be spread across the cluster,
but this one node would be unbalanced. By having more maps than nodes in the cluster,
this problem is avoided. For this reason, it's best to start by running distcp with the default
of 20 maps per node.
However, it's not always possible to prevent a cluster from becoming unbalanced. Perhaps
you want to limit the number of maps so that some of the nodes can be used by other jobs.
In this case, you can use the balancer tool (see Balancer ) to subsequently even out the
block distribution across the cluster.
[ 25 ] The architecture of HDFS is described in Robert Chansler et al.'s, “The Hadoop Distributed File Sys-
tem,” which appeared in The Architecture of Open Source Applications: Elegance, Evolution, and a Few
Fearless Hacks by Amy Brown and Greg Wilson (eds.).
[ 26 ] See Konstantin V. Shvachko and Arun C. Murthy, “Scaling Hadoop to 4000 nodes at Yahoo!” , Septem-
ber 30, 2008.
[ 27 ] See Chapter 10 for a typical machine specification.
Search WWH ::




Custom Search