Databases Reference
In-Depth Information
Sometimes people define availability as the portion of time that a service is running.
We think the definition should also include whether the application is serving requests
with good performance. There are many ways that a server can be running but not
really available. A common case is just after a MySQL server is restarted. It could take
many hours for a big server to warm up enough to serve queries with acceptable re-
sponse times, even if the server receives only a small portion of its normal traffic.
A related consideration is whether you'll lose any data, even if your application doesn't
go offline. If a server has a truly catastrophic failure, you might lose at least some data,
such as the last few transactions that were written to the (now lost) binary log and
didn't make it to a replica's relay log. Can you tolerate this? Most applications can; the
alternatives are usually expensive, complex, or have some performance overhead. For
example, you can use synchronous replication, or place the binary log on a device that's
replicated by DRBD so you won't lose it even if the server fails completely. (You can
still lose power to the whole data center, though.)
A smart application architecture can often reduce your availability needs, at least for
part of the system, and thus make high availability easier to achieve. Separating critical
and noncritical parts of your application can save you a lot of work and money, because
it's much easier to improve availability for a smaller system. You can identify high-
priority risks by calculating your “risk exposure,” which is the probability of failure
multiplied by the cost of failure. A simple spreadsheet of risks—with columns for the
probability, the cost, and the exposure—can help you prioritize your efforts.
In the previous chapter we examined how to achieve better scalability by avoiding the
causes of poor scalability. We'll take a similar approach here, because we believe that
availability is best understood by studying its opposite: downtime. Let's begin by dis-
cussing why downtime happens.
What Causes Downtime?
We've heard it said that the main cause of downtime in database servers is badly written
SQL queries, but is that really true? In 2009 we decided to analyze our database of
customer incidents and determine what really causes downtime, and how to prevent
it. 1 Although the results affirmed some of what we already believed, they contradicted
other beliefs, and we learned a lot.
We first categorized the downtime incidents by the way they manifested, rather than
by cause. Broadly speaking, what we call the “operating environment” was the leading
place that downtime appeared, with about 35% of incidents landing in this category.
The operating environment is the set of systems and resources that support the database
1. We wrote a lengthy white paper with the full analysis of our customers' downtime-causing incidents, and
followed it with another on how to prevent downtime, including detailed checklists of activities you can
schedule periodically. There wasn't room to include all the details in this topic, but you can find both
white papers on Percona's website ( http://www.percona.com ).
 
Search WWH ::




Custom Search