Tuning Recovery - Expert Oracle RAC Performance Diagnostics and Tuning

Database Reference

In-Depth Information

Instance Recovery

Instance recovery is to recover the database when an instance crashes midstream during user activity. Unlike in a

traditional single instance database scenario, recovery of an instance in a RAC environment is dynamic and happens

while the database is up and active. It is probably the most important aspect of recovery that applies to RAC. The idea

of having multiple nodes in a clustered configuration is to provide availability with the assumption that if one or more

instances in the cluster where to fail, the remaining instance would provide business continuum. For this reason,

instance recovery becomes more critical.

One of the primary requirements of a RAC configuration is to have the redo logs of all instances participating in

the cluster on the shared storage. The primary reason for such a requirement is to provide visibility of the redo logs

of any instance in the cluster to all other instances. This allows for any instance in the cluster to perform an instance

recovery operation during an instance failure.

Instance failure could happen in several ways; the common reason for an instance failure is when the node

itself fails. The node failure could be due to several reasons including power surge, operator error, and so forth.

Other reasons for an instance failure could be because a certain background process fails or dies or when there

is a kernel-level exception encountered by the instance, causing an ORA-0600 or ORA-07445 error. Issuing a

SHUTDOWN ABORT command could also cause an instance failure.

Instance failures could be of different kinds:

•

The instance is totally down and the users do not have any access to the instance.

•

The instance is up; however, when connecting to it, there is a hang situation or the user gets

no response.

In the case in which an instance is not available, users could continue accessing the database via one of the other

surviving instances in an active-active configuration provided the failover option has been enabled in the application.

Recovery from an instance failure happens from another instance that is up and running that is part of the

cluster configuration and whose heartbeat mechanism detected the failure first and informed the LMON process on

the node. The LMON process on each cluster node communicates with the CM on the respective node and exposes that

information to the respective instances.

LMON provides the monitoring function by continually sending messages from the node on which it runs and often

by writing to the shared disk. When the node fails to perform these functions, the other nodes consider that node as no

longer a member of the cluster. Such a failure causes a change in a node's membership status within the cluster.

The LMON process controls the recovery of the failed instance by taking over its redo log files and performing

instance recovery.

How Does Oracle Know That Recovery Is Required for a Given Data File?

The system change number (SCN) is a logical clock inside the database kernel that increments with each and every

change made to the database. The SCN describes a “version” or a committed version of the database. When a

database performs a checkpoint operation, an SCN (called the checkpoint SCN) is written to the data file headers.

This is called the start SCN. There is also an SCN value in the control file for every data file, which is called the stop

SCN. There is another data structure called the checkpoint counter in each data file header and also in the control file

for each data file entry. The checkpoint counter increments every time a checkpoint happens on a data file and the

start SCN value is updated. When a data file is in hot backup mode, the checkpoint information in the file header is

frozen; but the checkpoint counter still gets updated.

When the database is shut down gracefully, with the SHUTDOWN NORMAL or SHUTDOWN IMMEDIATE command,

Oracle performs a checkpoint and copies the start SCN value of each data file to its corresponding stop SCN value in

the control file before the actual shutdown of the database.

Expert Oracle RAC Performance Diagnostics and Tuning

Search WWH ::

Custom Search

Home