MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

NETWORK TOPOLOGY AND HADOOP

What does it mean for two nodes in a local network to be “close” to each other? In the context of high-

volume data processing, the limiting factor is the rate at which we can transfer data between nodes —

bandwidth is a scarce commodity. The idea is to use the bandwidth between two nodes as a measure of

distance.

Rather than measuring bandwidth between nodes, which can be difficult to do in practice (it requires a

quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes),

Hadoop takes a simple approach in which the network is represented as a tree and the distance between

two nodes is the sum of their distances to their closest common ancestor. Levels in the tree are not pre-

defined, but it is common to have levels that correspond to the data center, the rack, and the node that a

process is running on. The idea is that the bandwidth available for each of the following scenarios be-

comes progressively less:

▪ Processes on the same node

▪ Different nodes on the same rack

▪ Nodes on different racks in the same data center

▪ Nodes in different data centers [ 32 ]

For example, imagine a node n1 on rack r1 in data center d1 . This can be represented as /d1/r1/n1 . Using

this notation, here are the distances for the four scenarios:

▪ distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)

▪ distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)

▪ distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)

▪ distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

This is illustrated schematically in Figure 3-3 . (Mathematically inclined readers will notice that this is an

example of a distance metric.)

Search WWH ::

Custom Search

Home