Testing for Availability - Expert Oracle RAC Performance Diagnostics and Tuning

Database Reference

In-Depth Information

Up to this point, the OCW did the failure detection. Because a node failure also involves an instance failure, there

are further steps involved before one of the surviving instances perform recovery. Instance failure involves database

crash recovery followed by instance recovery.

Instance Failure

RAC is comprised of several instances talking to a common shared physical database. Because several instances

are involved in this configuration, one or multiple instance failures may occur. If all instances participating in the

configuration fail, the database is in an unusable state and could be called a crash or database crash ; and the recovery

process associated with this failure is called a crash recovery .

If only one or more of these instances fail, there is only an instance failure; and the recovery process associated

with this failure is called an instance recovery .

Instance failure could happen in several ways—the common reason for an instance failure is when the node fails

due to reasons such as a power surge, operator error, and so forth, or because one of the components of the cluster—

like the public NIC or the HBA device—failed. Other reasons for an instance failure could be when an operator issues

a SHUTDOWN ABORT , causing an instance failure.

Recovery from an instance failure begins when one of the surviving nodes (whose heartbeat mechanism detected

the failure first) informs the LMON process. The LMON process on each instance in the cluster communicates with the

OCW on the respective nodes and initiates instance recovery. The recovering instance will acquire the locks for the

redo thread of the other instance. The redo logs of the failed instance are read by the System Monitor (SMON) or

recovery slaves during recovery.

Server Pools

Starting with Oracle Database 11g Release 2, Oracle has introduced a new feature for node and instance management

called Policy Managed . This feature allows configuration of nodes into pools and prioritizes availability of nodes to

the instances that are part of specific pools. As discussed in the previous sections, if there is a node crash and the pool

does not have a sufficient number of instances required to be present in the pool, a server from another pool can

automatically be provisioned to the pool where failure occurred and the instances automatically started. Using the

pool management feature, rules can be defined based on criticality of the database and on workload patterns.

Oracle Component Failure

In Chapter 2, we discussed the various components of the RAC configuration. To keep the database up and running,

the clusterware on the server should be functioning. Several components of the clusterware are configured to

automatically restart (Cluster Ready Services Daemon [CRSD], Event Manager Daemon [EVMD], enable Oracle

Notification Service [eONS], etc.) on failure; others are critical, and if they fail (Oracle High Availability Service

Daemon [OHASD], Grid Naming Service Daemon [GNSD]) will cause the server or node to reboot automatically.

There are other components, such as the OCR, Oracle Local Registry (OLR), and voting disks, that provide support for

the clusterware to be functional and provide the HA services required. Starting with Oracle Database 11g Release 2,

the voting disks and OCR files are managed by ASM.

Protecting the OCR

There are two methods in which redundancy for the OCR files can be provided.

Search WWH ::

Custom Search

Home