Databases Reference
In-Depth Information
Secondary NameNode into its own machine, the spec of which should be comparable
to the NameNode. But, before going into how to set up a separate server as a
Secondary NameNode, I should explain what the Secondary NameNode does and
doesn't do, and in turn some of NameNode's underlying mechanics.
Due to its unfortunate naming, the Secondary NameNode (SNN) is sometimes
confused with a failover backup for NameNode. It most certainly is not. The SNN
only serves to periodically clean up and tighten the filesystem's state information in
NameNode, helping NameNode become more efficient. NameNode manages the
filesystem's state information using two files,
FsImage
and
EditLog
. The file
FsImage
is
a snapshot of the filesystem at some checkpoint, and
EditLog
records each incremental
change (
delta
) to the filesystem after that checkpoint. These two files can completely
determine the current state of the filesystem. When you initialize NameNode, it merges
these two files to create a new snapshot. At the end of NameNode's initialization,
FsImage
will contain the new snapshot and
EditLog
will be empty. Afterward any
operation that changes the state of HDFS is appended to
EditLog
, whereas
FsImage
will
remain unchanged. When you shut down NameNode and restart it, the consolidation
will take place again and make a new snapshot. Note that the two files are only for
retaining the filesystem's state information while NameNode is not running (either
intentionally shut down or due to system malfunction). NameNode keeps in memory
a constantly maintained copy of the filesystem's state information to quickly answer
queries about the filesystem.
For a busy cluster, the
EditLog
file will grow quite large, and the next restart of
NameNode will take a long time to merge
EditLog
into
FsImage
. For busy clusters,
it can also be a long time in between NameNode restarts, and you may want more
frequent snapshots for archival purposes. This is where SNN comes in. It consolidates
FsImage
and
EditLog
into a new snapshot and leaves the NameNode alone to serve
live traffic. Therefore, it's more appropriate to think of the SNN as a checkpointing
server. Merging
FsImage
and
EditLog
is memory intensive, requiring an amount of
memory on the same order as normal NameNode operation. It's best for the SNN to
be on a separate server that is as powerful as the primary NameNode.
To configure HDFS to use a separate server as the SNN, first list that server's host
name or IP address in the
conf/masters
file. Unfortunately, this file name is also
confusing. The masters in Hadoop (NameNode and JobTracker) are whichever
machine you run
bin/start-dfs.sh
and
bin/start-mapred.sh
on. What's listed in
conf/masters
is the SNN, not any of the masters.
You should also modify the conf/hdfs-site.xml file on the SNN such that the
dfs.
http.address
property points to port 50070 of the NameNode's host address, like
<property>
<name>dfs.http.address</name>
<value>
namenode.hadoop-host.com
:50070</value>
</property>
You should set this property because the SNN retrieves
FsImage
and
EditLog
from the
NameNode by sending HTTP Get requests to the URLs: