Databases Reference
In-Depth Information
redundancy, you can stop using the failed piece and start using its redundant standby
instead. The combination of redundancy and failover can enable you to recover more
quickly, and as you know, reducing MTTR reduces downtime and improves
availability.
Before we continue, we should talk about a few terms. We use “failover” consistently;
some people use “fallback” as a synonym. Sometimes people also say “switchover” to
denote a switch that's planned instead of a response to a failure. Po-tay-toe,
po-tah-toe. We also use the term “failback” to indicate the reverse of failover. If you
have failback capability, failover can be a two-way process: when server A fails and
server B replaces it, you can repair server A and fail back to it.
Failover is good for more than just recovery from failures. You can also do planned
failovers to reduce downtime (improve availability) for events such as upgrades, schema
changes, application modifications, or scheduled maintenance.
You need to identify how fast failover needs to be, but you also need to know how
quickly you have to replace the failed component after a failover. Until you restore the
system's depleted standby capacity, you have less redundancy and you're exposed to
extra risk. Thus, having a standby doesn't eliminate the need for timely replacement
of failed components. How quickly can you build a new standby server, install its op-
erating system, and give it a fresh copy of your data? Do you have enough standby
machines? You might need more than one.
Failover comes in many flavors. We've already discussed several of them, because load
balancing and failover are similar in many ways, and the line between them is a bit
fuzzy. In general, we think a full failover solution, at a minimum, needs to be able to
monitor and automatically replace a component. This should ideally be transparent to
the application. Load balancing need not provide this capability.
In the Unix world, failover is often accomplished with the tools provided by the High
Availability Linux project ( http://linux-ha.org ) , which run on many Unix-like operating
systems, not just Linux. The Linux-HA stack has become significantly more featureful
in the last few years. Today most people think of Pacemaker as the main component
in the stack. Pacemaker replaces the older heartbeat tool. Various other tools accom-
plish IP takeover and load-balancing functionality. You can combine them with DRBD
and/or LVS.
The most important part of failover is failback. If you can't switch back and forth
between servers at will, failover is a dead end and only postpones downtime. This is
why we like symmetrical replication topologies, such as the dual-master configuration,
and we dislike ring replication with three or more co-masters. If the configuration is
symmetrical, failover and failback are the same operation in opposite directions. (It's
worth mentioning that DRBD has built-in failback capabilities.)
In some applications, it's critical that failover and failback be as fast and atomic as
possible. Even when it's not critical, it's still a good idea not to rely on things that are
 
Search WWH ::




Custom Search