MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

% hadoop distcp -update -delete -p hdfs://namenode1/foo

hdfs://namenode2/foo

The -delete flag causes distcp to delete any files or directories from the destination that

are not present in the source, and -p means that file status attributes like permissions,

block size, and replication are preserved. You can run distcp with no arguments to see pre-

cise usage instructions.

If the two clusters are running incompatible versions of HDFS, then you can use the

webhdfs protocol to distcp between them:

% hadoop distcp webhdfs://namenode1:50070/foo

webhdfs://namenode2:50070/foo

Another variant is to use an HttpFs proxy as the distcp source or destination (again using

the webhdfs protocol), which has the advantage of being able to set firewall and band-

width controls (see HTTP ) .

Keeping an HDFS Cluster Balanced

When copying data into HDFS, it's important to consider cluster balance. HDFS works

best when the file blocks are evenly spread across the cluster, so you want to ensure that

distcp doesn't disrupt this. For example, if you specified -m 1 , a single map would do the

copy, which — apart from being slow and not using the cluster resources efficiently —

would mean that the first replica of each block would reside on the node running the map

(until the disk filled up). The second and third replicas would be spread across the cluster,

but this one node would be unbalanced. By having more maps than nodes in the cluster,

this problem is avoided. For this reason, it's best to start by running distcp with the default

of 20 maps per node.

However, it's not always possible to prevent a cluster from becoming unbalanced. Perhaps

you want to limit the number of maps so that some of the nodes can be used by other jobs.

In this case, you can use the balancer tool (see Balancer ) to subsequently even out the

block distribution across the cluster.

[ 25 ] The architecture of HDFS is described in Robert Chansler et al.'s, “The Hadoop Distributed File Sys-

tem,” which appeared in The Architecture of Open Source Applications: Elegance, Evolution, and a Few

Fearless Hacks by Amy Brown and Greg Wilson (eds.).

[ 26 ] See Konstantin V. Shvachko and Arun C. Murthy, “Scaling Hadoop to 4000 nodes at Yahoo!” , Septem-

ber 30, 2008.

[ 27 ] See Chapter 10 for a typical machine specification.

Search WWH ::

Custom Search

Home