Setting Up a Hadoop Cluster - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

namenode's metadata is stored on a remote filesystem). However, as the cluster gets lar-

ger, there are good reasons to separate them.

The namenode has high memory requirements, as it holds file and block metadata for the

entire namespace in memory. The secondary namenode, although idle most of the time,

has a comparable memory footprint to the primary when it creates a checkpoint. (This is

explained in detail in The filesystem image and edit log . ) For filesystems with a large

number of files, there may not be enough physical memory on one machine to run both

the primary and secondary namenode.

Aside from simple resource requirements, the main reason to run masters on separate ma-

chines is for high availability. Both HDFS and YARN support configurations where they

can run masters in active-standby pairs. If the active master fails, then the standby, run-

ning on separate hardware, takes over with little or no interruption to the service. In the

case of HDFS, the standby performs the checkpointing function of the secondary namen-

ode (so you don't need to run a standby and a secondary namenode).

Configuring and running Hadoop HA is not covered in this topic. Refer to the Hadoop

website or vendor documentation for details.

Network Topology

A common Hadoop cluster architecture consists of a two-level network topology, as illus-

trated in Figure 10-1 . Typically there are 30 to 40 servers per rack (only 3 are shown in

the diagram), with a 10 Gb switch for the rack and an uplink to a core switch or router (at

least 10 Gb or better). The salient point is that the aggregate bandwidth between nodes on

the same rack is much greater than that between nodes on different racks.

Search WWH ::

Custom Search

Home