Databases Reference
In-Depth Information
The projected time that it will take to bring a failed system online is called the recovery time
objective (RTO) . The estimated amount of original data that may be lost while the failover is being
executed is called the recovery point objective (RPO) . When designing a recovery plan it is impor-
tant to communicate clear RTO and RPO objectives.
Ensure that the recovery objectives will meet business requirements. Be aware that systems that
provide a rapid failover time with little or no data loss are often extremely expensive. A good rule
of thumb is that the shorter the downtime and the lower the expected data loss, the more the system
will cost.
Let's look at an example failover system that has an RTO of 1 hour and an RPO of 10 minutes. This
imaginary system is going to cost $10,000 and will require DBAs to bring up the remote system
within one hour. If we enhance this example system to automatic failover it will reduce our RTO to
10 minutes, and we should not lose any data with an RPO of 0. Unfortunately, this system will cost
us half a million dollars.
MEASURING PERFORMANCE
The single most important performance metric is latency. Latency is a measure of system health and
the availability of system resources. Latency is governed by queuing theory, a mathematical study of
lines, or queues. An important contribution to queuing theory, now known as Little's Law, was intro-
duced in a proof submitted by John D. C. Little ( http://or.journal.informs.org/content/
9/3/383) in 1961.
Put simply, Little's Law states that given a steady-state system, as capacity reaches maximum per-
formance, response time approaches ini nity. To understand the power of Little's Law, consider the
typical grocery store. If the store only opens one cash register and ten people are waiting in line,
then you are going to wait longer to pay than if the store opened i ve or ten cashiers.
Storage systems are directly analogous to a grocery store checkout line. Each component has a
certain performance maximum. Driving the system toward maximum performance will increase
latency. We have found that most users and application owners directly correlate latency with fail-
ure. For example, it won't matter that a payroll system is online if it can't process transactions fast
enough to get everyone's paycheck out on time.
You test I/O performance using several tools that are described later. You test storage using a loga-
rithmic scale, starting with one I/O, moving to two, then four, then eight, and i nally peaking at 256
I/Os that are all sent to storage in parallel (see Figure 4-1). As it turns out, this test perfectly demon-
strates Little's Law by dei ning how storage operates.
As you can see in Figure 4-11 the storage response time remains less than our goal of 10 millisec-
onds through eight outstanding I/Os. As we increase the workload to 16 outstanding I/Os, the
latency increases to 20 milliseconds. We can determine from this test that our coni guration is
optimal when we issue between 8 and 16 I/Os. This is called the knee of the curve . The system is
capable of a lot more work, but the latency is higher than our tolerance.
 
Search WWH ::




Custom Search