Managing Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

Secondary NameNode into its own machine, the spec of which should be comparable

to the NameNode. But, before going into how to set up a separate server as a

Secondary NameNode, I should explain what the Secondary NameNode does and

doesn't do, and in turn some of NameNode's underlying mechanics.

Due to its unfortunate naming, the Secondary NameNode (SNN) is sometimes

confused with a failover backup for NameNode. It most certainly is not. The SNN

only serves to periodically clean up and tighten the filesystem's state information in

NameNode, helping NameNode become more efficient. NameNode manages the

filesystem's state information using two files, FsImage and EditLog . The file FsImage is

a snapshot of the filesystem at some checkpoint, and EditLog records each incremental

change ( delta ) to the filesystem after that checkpoint. These two files can completely

determine the current state of the filesystem. When you initialize NameNode, it merges

these two files to create a new snapshot. At the end of NameNode's initialization,

FsImage will contain the new snapshot and EditLog will be empty. Afterward any

operation that changes the state of HDFS is appended to EditLog , whereas FsImage will

remain unchanged. When you shut down NameNode and restart it, the consolidation

will take place again and make a new snapshot. Note that the two files are only for

retaining the filesystem's state information while NameNode is not running (either

intentionally shut down or due to system malfunction). NameNode keeps in memory

a constantly maintained copy of the filesystem's state information to quickly answer

queries about the filesystem.

For a busy cluster, the EditLog file will grow quite large, and the next restart of

NameNode will take a long time to merge EditLog into FsImage . For busy clusters,

it can also be a long time in between NameNode restarts, and you may want more

frequent snapshots for archival purposes. This is where SNN comes in. It consolidates

FsImage and EditLog into a new snapshot and leaves the NameNode alone to serve

live traffic. Therefore, it's more appropriate to think of the SNN as a checkpointing

server. Merging FsImage and EditLog is memory intensive, requiring an amount of

memory on the same order as normal NameNode operation. It's best for the SNN to

be on a separate server that is as powerful as the primary NameNode.

To configure HDFS to use a separate server as the SNN, first list that server's host

name or IP address in the conf/masters file. Unfortunately, this file name is also

confusing. The masters in Hadoop (NameNode and JobTracker) are whichever

machine you run bin/start-dfs.sh and bin/start-mapred.sh on. What's listed in

conf/masters is the SNN, not any of the masters.

You should also modify the conf/hdfs-site.xml file on the SNN such that the dfs.

http.address property points to port 50070 of the NameNode's host address, like

<name>dfs.http.address</name>

<value> namenode.hadoop-host.com :50070</value>

</property>

You should set this property because the SNN retrieves FsImage and EditLog from the

NameNode by sending HTTP Get requests to the URLs:

Search WWH ::

Custom Search

Home