Replication - MongoDB in Action

Database Reference

In-Depth Information

F AILOVER AND RECOVERY

You saw in the sample replica set a couple examples of failover. Here I summarize the

rules of failover and provide some suggestions on handling recovery.

A replica set will come online when all members specified in the configuration can

communicate with each other. Each node is given one vote by default, and those votes

are used to form a majority and elect a primary. This means that a replica set can be

started with as few as two nodes (and votes). But the initial number of votes also

decides what constitutes a majority in the event of a failure.

Let's assume that you've configured a replica set of three complete replicas (no

arbiters) and thus have the recommended minimum for automated failover. If the pri-

mary fails, and the remaining secondaries can see each other, then a new primary can

be elected. As for deciding which one, the secondary with the most up-to-date oplog

(or higher priority) will be elected primary.

Failure modes and recovery

Recovery is the process of restoring the replica set to its original state following a fail-

ure. There are two overarching failure categories to be handled. The first comprises

what is called clean failures , where a given node's data files can still be assumed to be

intact. One example of this is a network partition. If a node loses its connections to

the rest of the set, then you need only wait for connectivity to be restored, and the par-

titioned node will resume as a set member. A similar situation occurs when a given

node's mongod process is terminated for any reason but can be brought back online

cleanly. 9 Again, once the process is restarted, it can rejoin the set.

The second type of failure comprises all categorical failures , where either a node's

data files no longer exist or must be presumed corrupted. Unclean shutdowns of the

mongod process without journaling enabled and hard drive crashes are both examples

of this kind of failure. The only ways to recover a categorically failed node are to com-

pletely replace the data files via a resync or to restore from a recent backup. Let's look

a both strategies in turn.

To completely resync, start a mongod with an empty data directory on the failed

node. As long as the host and port haven't changed, the new mongod will rejoin the

replica set and then resync all the existing data. If either the host or port has changed,

then after bringing the mongod back online, you'll also have to reconfigure the replica

set. As an example, suppose the node at arete:40001 is rendered unrecoverable and

you bring up a new node at foobar:40000. You can reconfigure the replica set by grab-

bing the configuration document, modifying the host for the second node, and then

passing that to the rs.reconfig() method:

> use local

> config = db.system.replset.findOne()

{

"_id" : "myapp",

9

For instance, if MongoDB is shut down cleanly then you know that the data files are okay. Alternatively, if run-

ning with journaling, the MongoDB instance should be recoverable regardless of how it's killed.

MongoDB in Action

Search WWH ::

Custom Search

Home