Managing Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

writing, this block placement policy is baked into the NameNode. A pluggable policy

is targeted for version 0.21. 6

Besides block placement, task placement

is also rack aware. A task is usually placed

on a node that has a copy of the block the task is assigned to process. When no such

node is available to take on the new task, the task is randomly assigned to a node on a

rack where a copy of the block is available somewhere on that rack. That is, when data

locality can't be enforced at a node level, Hadoop tries to enforce it at the rack level.

Failing that, a task would be randomly assigned to one of the remaining nodes.

At this point you may wonder how Hadoop knows which rack a node is at. It requires

you to tell it. It assumes a hierarchical network topology

for your Hadoop cluster,

structurally similar to figure 8.1. Each node has a rack name similar to a file path.

For example, the nodes H1, H2, and H3 in figure 8.1 all have a rack name of /D1/R1 .

Figure 8.1 shows a case where you have multiple datacenters (D1 and D2) each with

multiple racks (R1 to R4). In most cases you'll be dealing with multiple racks co-located

together. Your rack names will be in a flat namespace, such as /R1 and /R2 .

To help Hadoop know the location of each node, you have to provide an executable

script that can map IP addresses into rack names. This network topology script must reside

on the master node and its location is specified in the topology.script.file.name

property in core-site.xml . Hadoop will call this script with a set of IP addresses as

separate arguments. The script should print out (through STDOUT) the rack name

corresponding to each IP address in the same order, separated by whitespace. The

topology.script.number.args property controls the maximum number of IP

addresses Hadoop will ask for at any one time. It's convenient to simplify your script by

setting that value to 1. Here is an example a network topology script.

/

D1

D2

R1

R2

R3

R4

H1

H2

H3

H4

H5

H6

H7

H8

H9

H10 H11 H12

Figure 8.1 A cluster with a hierarchical network topology. This cluster spans

multiples datacenters (D1 and D2). Each datacenter has multiple racks (R), and

each rack has multiple machines.

6 See http://issues.apache.org/jira/browse/HDFS-385 for the description of this change.

Hadoop in Action

Search WWH ::

Custom Search

Home