Storage Systems - SQL Server 2012 Internals and Troubleshooting

Databases Reference

In-Depth Information

The projected time that it will take to bring a failed system online is called the recovery time

objective (RTO) . The estimated amount of original data that may be lost while the failover is being

executed is called the recovery point objective (RPO) . When designing a recovery plan it is impor-

tant to communicate clear RTO and RPO objectives.

Ensure that the recovery objectives will meet business requirements. Be aware that systems that

provide a rapid failover time with little or no data loss are often extremely expensive. A good rule

of thumb is that the shorter the downtime and the lower the expected data loss, the more the system

will cost.

Let's look at an example failover system that has an RTO of 1 hour and an RPO of 10 minutes. This

imaginary system is going to cost $10,000 and will require DBAs to bring up the remote system

within one hour. If we enhance this example system to automatic failover it will reduce our RTO to

10 minutes, and we should not lose any data with an RPO of 0. Unfortunately, this system will cost

us half a million dollars.

MEASURING PERFORMANCE

The single most important performance metric is latency. Latency is a measure of system health and

the availability of system resources. Latency is governed by queuing theory, a mathematical study of

lines, or queues. An important contribution to queuing theory, now known as Little's Law, was intro-

duced in a proof submitted by John D. C. Little ( http://or.journal.informs.org/content/

9/3/383) in 1961.

Put simply, Little's Law states that given a steady-state system, as capacity reaches maximum per-

formance, response time approaches ini nity. To understand the power of Little's Law, consider the

typical grocery store. If the store only opens one cash register and ten people are waiting in line,

then you are going to wait longer to pay than if the store opened i ve or ten cashiers.

Storage systems are directly analogous to a grocery store checkout line. Each component has a

certain performance maximum. Driving the system toward maximum performance will increase

latency. We have found that most users and application owners directly correlate latency with fail-

ure. For example, it won't matter that a payroll system is online if it can't process transactions fast

enough to get everyone's paycheck out on time.

You test I/O performance using several tools that are described later. You test storage using a loga-

rithmic scale, starting with one I/O, moving to two, then four, then eight, and i nally peaking at 256

I/Os that are all sent to storage in parallel (see Figure 4-1). As it turns out, this test perfectly demon-

strates Little's Law by dei ning how storage operates.

As you can see in Figure 4-11 the storage response time remains less than our goal of 10 millisec-

onds through eight outstanding I/Os. As we increase the workload to 16 outstanding I/Os, the

latency increases to 20 milliseconds. We can determine from this test that our coni guration is

optimal when we issue between 8 and 16 I/Os. This is called the knee of the curve . The system is

capable of a lot more work, but the latency is higher than our tolerance.

Search WWH ::

Custom Search

Home