Database Reference
In-Depth Information
namenode's metadata is stored on a remote filesystem). However, as the cluster gets lar-
ger, there are good reasons to separate them.
The namenode has high memory requirements, as it holds file and block metadata for the
entire namespace in memory. The secondary namenode, although idle most of the time,
has a comparable memory footprint to the primary when it creates a checkpoint. (This is
explained in detail in The filesystem image and edit log . ) For filesystems with a large
number of files, there may not be enough physical memory on one machine to run both
the primary and secondary namenode.
Aside from simple resource requirements, the main reason to run masters on separate ma-
chines is for high availability. Both HDFS and YARN support configurations where they
can run masters in active-standby pairs. If the active master fails, then the standby, run-
ning on separate hardware, takes over with little or no interruption to the service. In the
case of HDFS, the standby performs the checkpointing function of the secondary namen-
ode (so you don't need to run a standby and a secondary namenode).
Configuring and running Hadoop HA is not covered in this topic. Refer to the Hadoop
website or vendor documentation for details.
Network Topology
A common Hadoop cluster architecture consists of a two-level network topology, as illus-
trated in Figure 10-1 . Typically there are 30 to 40 servers per rack (only 3 are shown in
the diagram), with a 10 Gb switch for the rack and an uplink to a core switch or router (at
least 10 Gb or better). The salient point is that the aggregate bandwidth between nodes on
the same rack is much greater than that between nodes on different racks.
Search WWH ::




Custom Search