Databases Reference
In-Depth Information
Is money being lost? If so, quantify the amount.
What is the visibility of the issue?
Are external customers affected?
Could any regulatory or compliance obligations be breeched?
How serious are the consequences if the problem persists?
Management can also be enlisted to identify mitigating factors. Are any options available to run a
degraded service such as manual systems that enable some operations to continue? Encourage man-
agers to generate ideas for a short-term tactical solution while the root cause is investigated and a
resolution implemented.
Managers might also be helpful in engaging third parties, initially to make contact and open a
dialog, and, in situations in which escalation is required, to engage the right resources to advance
a solution. Each of these factors can be used to help shape the solution.
Service-Level Agreements
A service-level agreement (SLA) forms an agreement between IT and the business or between an
outsourcer and an organization. The SLA should dei ne availability and performance metrics for
key business applications. SLAs often include metrics for response and resolution times in the event
of an incident. These agreements are non-functional requirements and useful for managing business
expectations in terms of application performance, availability, and response time in the event of an
incident.
Two terms commonly used in storage solution design can be borrowed and adapted to most other
areas of IT and business agreements: recovery point objective (RPO) and recovery time objective
(RTO) . Both can be included within an SLA to govern the data loss and recovery period following
an incident.
RTO refers to the amount of time a solution can be down before the system is recovered. This
varies according to the type of failure — for example, in the event of a single server failure in a
failover cluster, the RTO could reasonably be 1-2 minutes; in the event of a total site loss, it might
reasonably be four hours. This RTO metric essentially governs how long IT has to restore service in
the event of various types of failures.
RPO refers to how much data loss can be tolerated without impact to the business. In the SQL
Server world this commonly determines the frequency of transaction log backups. If, for example,
the RPO were i ve minutes, you would need to take log backups every i ve minutes to ensure a
maximum data loss of the same duration. Combining these facets of an agreement, it would be
fairly common for a DBA to agree to coni gure i ve-minute log backups, and log shipping to a
second location with an RPO of 15 minutes and an RTO of four hours. This would mean bringing
the disaster recovery location online within four hours and ensuring a maximum data loss duration
of 15 minutes. Agreeing to these objectives ahead of time with the business is an important part of
setting and managing expectations.
 
Search WWH ::




Custom Search