High Availability - High Performance MySQL

Databases Reference

In-Depth Information

Sometimes people define availability as the portion of time that a service is running.

We think the definition should also include whether the application is serving requests

with good performance. There are many ways that a server can be running but not

really available. A common case is just after a MySQL server is restarted. It could take

many hours for a big server to warm up enough to serve queries with acceptable re-

sponse times, even if the server receives only a small portion of its normal traffic.

A related consideration is whether you'll lose any data, even if your application doesn't

go offline. If a server has a truly catastrophic failure, you might lose at least some data,

such as the last few transactions that were written to the (now lost) binary log and

didn't make it to a replica's relay log. Can you tolerate this? Most applications can; the

alternatives are usually expensive, complex, or have some performance overhead. For

example, you can use synchronous replication, or place the binary log on a device that's

replicated by DRBD so you won't lose it even if the server fails completely. (You can

still lose power to the whole data center, though.)

A smart application architecture can often reduce your availability needs, at least for

part of the system, and thus make high availability easier to achieve. Separating critical

and noncritical parts of your application can save you a lot of work and money, because

it's much easier to improve availability for a smaller system. You can identify high-

priority risks by calculating your “risk exposure,” which is the probability of failure

multiplied by the cost of failure. A simple spreadsheet of risks—with columns for the

probability, the cost, and the exposure—can help you prioritize your efforts.

In the previous chapter we examined how to achieve better scalability by avoiding the

causes of poor scalability. We'll take a similar approach here, because we believe that

availability is best understood by studying its opposite: downtime. Let's begin by dis-

cussing why downtime happens.

What Causes Downtime?

We've heard it said that the main cause of downtime in database servers is badly written

SQL queries, but is that really true? In 2009 we decided to analyze our database of

customer incidents and determine what really causes downtime, and how to prevent

it. 1 Although the results affirmed some of what we already believed, they contradicted

other beliefs, and we learned a lot.

We first categorized the downtime incidents by the way they manifested, rather than

by cause. Broadly speaking, what we call the “operating environment” was the leading

place that downtime appeared, with about 35% of incidents landing in this category.

The operating environment is the set of systems and resources that support the database

1. We wrote a lengthy white paper with the full analysis of our customers' downtime-causing incidents, and

followed it with another on how to prevent downtime, including detailed checklists of activities you can

schedule periodically. There wasn't room to include all the details in this topic, but you can find both

white papers on Percona's website ( http://www.percona.com ).

Search WWH ::

Custom Search

Home