Databases Reference
In-Depth Information
writing, this block placement policy is baked into the NameNode. A pluggable policy
is targeted for version 0.21. 6
Besides block placement, task placement
is also rack aware. A task is usually placed
on a node that has a copy of the block the task is assigned to process. When no such
node is available to take on the new task, the task is randomly assigned to a node on a
rack where a copy of the block is available somewhere on that rack. That is, when data
locality can't be enforced at a node level, Hadoop tries to enforce it at the rack level.
Failing that, a task would be randomly assigned to one of the remaining nodes.
At this point you may wonder how Hadoop knows which rack a node is at. It requires
you to tell it. It assumes a hierarchical network topology
for your Hadoop cluster,
structurally similar to figure 8.1. Each node has a rack name similar to a file path.
For example, the nodes H1, H2, and H3 in figure 8.1 all have a rack name of /D1/R1 .
Figure 8.1 shows a case where you have multiple datacenters (D1 and D2) each with
multiple racks (R1 to R4). In most cases you'll be dealing with multiple racks co-located
together. Your rack names will be in a flat namespace, such as /R1 and /R2 .
To help Hadoop know the location of each node, you have to provide an executable
script that can map IP addresses into rack names. This network topology script must reside
on the master node and its location is specified in the topology.script.file.name
property in core-site.xml . Hadoop will call this script with a set of IP addresses as
separate arguments. The script should print out (through STDOUT) the rack name
corresponding to each IP address in the same order, separated by whitespace. The
topology.script.number.args property controls the maximum number of IP
addresses Hadoop will ask for at any one time. It's convenient to simplify your script by
setting that value to 1. Here is an example a network topology script.
/
D1
D2
R1
R2
R3
R4
H1
H2
H3
H4
H5
H6
H7
H8
H9
H10 H11 H12
Figure 8.1 A cluster with a hierarchical network topology. This cluster spans
multiples datacenters (D1 and D2). Each datacenter has multiple racks (R), and
each rack has multiple machines.
6 See http://issues.apache.org/jira/browse/HDFS-385 for the description of this change.
 
Search WWH ::




Custom Search