Database Reference
In-Depth Information
Up to this point, the OCW did the failure detection. Because a node failure also involves an instance failure, there
are further steps involved before one of the surviving instances perform recovery. Instance failure involves database
crash recovery followed by instance recovery.
Instance Failure
RAC is comprised of several instances talking to a common shared physical database. Because several instances
are involved in this configuration, one or multiple instance failures may occur. If all instances participating in the
configuration fail, the database is in an unusable state and could be called a crash or database crash ; and the recovery
process associated with this failure is called a crash recovery .
If only one or more of these instances fail, there is only an instance failure; and the recovery process associated
with this failure is called an instance recovery .
Instance failure could happen in several ways—the common reason for an instance failure is when the node fails
due to reasons such as a power surge, operator error, and so forth, or because one of the components of the cluster—
like the public NIC or the HBA device—failed. Other reasons for an instance failure could be when an operator issues
a SHUTDOWN ABORT , causing an instance failure.
Recovery from an instance failure begins when one of the surviving nodes (whose heartbeat mechanism detected
the failure first) informs the LMON process. The LMON process on each instance in the cluster communicates with the
OCW on the respective nodes and initiates instance recovery. The recovering instance will acquire the locks for the
redo thread of the other instance. The redo logs of the failed instance are read by the System Monitor (SMON) or
recovery slaves during recovery.
Server Pools
Starting with Oracle Database 11g Release 2, Oracle has introduced a new feature for node and instance management
called Policy Managed . This feature allows configuration of nodes into pools and prioritizes availability of nodes to
the instances that are part of specific pools. As discussed in the previous sections, if there is a node crash and the pool
does not have a sufficient number of instances required to be present in the pool, a server from another pool can
automatically be provisioned to the pool where failure occurred and the instances automatically started. Using the
pool management feature, rules can be defined based on criticality of the database and on workload patterns.
Oracle Component Failure
In Chapter 2, we discussed the various components of the RAC configuration. To keep the database up and running,
the clusterware on the server should be functioning. Several components of the clusterware are configured to
automatically restart (Cluster Ready Services Daemon [CRSD], Event Manager Daemon [EVMD], enable Oracle
Notification Service [eONS], etc.) on failure; others are critical, and if they fail (Oracle High Availability Service
Daemon [OHASD], Grid Naming Service Daemon [GNSD]) will cause the server or node to reboot automatically.
There are other components, such as the OCR, Oracle Local Registry (OLR), and voting disks, that provide support for
the clusterware to be functional and provide the HA services required. Starting with Oracle Database 11g Release 2,
the voting disks and OCR files are managed by ASM.
Protecting the OCR
There are two methods in which redundancy for the OCR files can be provided.
 
Search WWH ::




Custom Search