Database Reference
In-Depth Information
What Is High Availability?
As shown in the previous example of the online store application, business urges IT departments to provide solutions
to meet the availability requirements of business applications. As the centerpiece of most business applications,
database availability is the key to keeping all the applications available.
In most IT organizations, Service Level Agreements (SLAs) are used to define the application availability
agreement between business and IT organization. They can be defined as the percentage availability, or the maximum
downtime allowed per month or per year. For example, an SLA that specifies 99.999% availability means less than
5.26 minutes downtime allowed annually. Sometimes an SLA also specifies the particular time window allowed for
downtime; for example, a back-end office application database can be down between midnight and 4 a.m. the first
Saturday of each quarter for scheduled maintenance such as hardware and software upgrades.
Since most high availability solutions require additional hardware and/or software, the cost of these solutions
can be high. Companies should determine their HA requirements based on the nature of the applications and the
cost structure. For example some back-end office applications such as a human resource application may not need to
be online 24x7. For those mission-critical business applications that need to be highly available, an evaluation of the
cost of downtime may be calculated too; for example, how much money can be lost due to 1 hour of downtime. Then
we can compare the downtime costs with the capital costs and operational expenses associated with the design and
implementation of various levels of availability solution. This kind of comparison will help business managers and IT
departments come up with realistic SLAs that meet their real business and affordability needs and that their IT team
can deliver.
Many business applications consist of multi-tier applications that run on multiple computers in a distributed
network. The availability of the business applications depends not only on the infrastructure that supports these
multi-tier applications, including the server hardware, storage, network, and OS, but also on each tier of the
applications, such as web servers, application servers, and database servers. In this chapter, I will focus mainly on the
availability of the database server, which is the database administrator's responsibility.
Database availability also plays a critical role in application availability. We use
downtime
to refer to the periods
when a database is unavailable. The downtime can be either unplanned downtime or planned downtime. Unplanned
downtime can occur without being prepared by system admin or DBAs—it may be caused by an unexpected event
such as hardware or software failure, human error, or even a natural disaster (losing a data center). Most unplanned
downtime can be anticipated; for example, when designing a cluster it is best to make the assumption that everything
will fail, considering that most of these clusters are commodity clusters and hence have parts which break. The key
when designing the availability of the system is to ensure that it has sufficient redundancy built into it, assuming
that every component (including the entire site) may fail. Planned downtime is usually associated with scheduled
maintenance activities such as system upgrade or migration.
Unplanned downtime of the Oracle database service can be due to data loss or server failure. The data loss may
be caused by storage medium failure, data corruption, deletion of data by human error, or even data center failure.
Data loss can be a very serious failure as it may turn out to be permanent, or could take a long time to recover from.
The solutions to data loss consist of prevention methods and recovery methods. Prevention methods include disk
mirroring by RAID
(Redundant Array of Independent Disks)
configurations such as RAID 1 (mirroring only) and
RAID 10 (mirroring and striping) in the storage array or with ASM (Automatic Storage Management) diskgroup
redundancy setting. Chapter 5 will discuss the details of the RAID configurations and ASM configurations for Oracle
Databases. Recovery methods focus on getting the data back through database recovery from the previous database
backup or flashback recovery or switching to the standby database through Data Guard failover.
Server failure is usually caused by hardware or software failure. Hardware failure can be physical machine
component failure, network or storage connection failure; and software failure can be caused by an OS crash, or
Oracle database instance or ASM instance failure. Usually during server failure, data in the database remains intact.
After the software or hardware issue is fixed, the database service on the failed server can be resumed after completing
database instance recovery and startup. Database service downtime due to server failure can be prevented by
providing redundant database servers so that the database service can fail over in case of primary server failure.
Network and storage connection failure can be prevented by providing redundant network and storage connections.