Database Reference
In-Depth Information
NETWORK TOPOLOGY AND HADOOP
What does it mean for two nodes in a local network to be “close” to each other? In the context of high-
volume data processing, the limiting factor is the rate at which we can transfer data between nodes —
bandwidth is a scarce commodity. The idea is to use the bandwidth between two nodes as a measure of
distance.
Rather than measuring bandwidth between nodes, which can be difficult to do in practice (it requires a
quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes),
Hadoop takes a simple approach in which the network is represented as a tree and the distance between
two nodes is the sum of their distances to their closest common ancestor. Levels in the tree are not pre-
defined, but it is common to have levels that correspond to the data center, the rack, and the node that a
process is running on. The idea is that the bandwidth available for each of the following scenarios be-
comes progressively less:
▪ Processes on the same node
▪ Different nodes on the same rack
▪ Nodes on different racks in the same data center
▪ Nodes in different data centers [ 32 ]
For example, imagine a node n1 on rack r1 in data center d1 . This can be represented as /d1/r1/n1 . Using
this notation, here are the distances for the four scenarios:
distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)
distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)
This is illustrated schematically in Figure 3-3 . (Mathematically inclined readers will notice that this is an
example of a distance metric.)
Search WWH ::




Custom Search