Testing for Availability - Expert Oracle RAC Performance Diagnostics and Tuning

Database Reference

In-Depth Information

1

2

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

3

SSKY_2

SSKY_3

SSKY_4

SSKY_5

SSKY_6

SSKY_1

5

4

6

AV7

AV8

AV9

AV10

AV11

AV12

AV13

AV14

AV15

AV16

AV17

AV18

GRID_DATA

PRD_DATA

PRD_FRA

Figure 3-1. Points of failure in a RAC hardware configuration

We briefly examine each of these failure scenarios and discuss the various methods to protect them from these

failures.

Interconnect Failure

If the interconnect between nodes fail, either because of a physical failure or a software failure in the communication

or interprocess communication (IPC) layer, it appears to the Oracle Clusterware (OCW) at each end of the

interconnect that the node at the other end has failed. The OCW should use an alternative method, such as checking

for a quorum disk, to evaluate the status of the system. In the case of a complete communication link failure, a voting

disk protocol is initiated. Whichever node grabs the most number of disks becomes the master. The master writes a

kill block to the disk in case the communication link is down. Instance will then kill itself.

Eventually, it may shut down both the nodes involved in the operation or just one of the nodes at the end of the

failed connection. It will evict the node by means of fencing to prevent any continued writes that could potentially

corrupt the database.

Traditionally in a RAC environment, when a node or instance fails, an instance is elected to perform instance

recovery. The Global Enqueue Service and the Global Cache Service are reconfigured after the failure; redo logs are

merged and rolled forward. The transactions that have not been committed are rolled back.

This operation is performed by one of the surviving instances reading through the redo log files of the failed

instance. Such recovery provides users immediate access to consistent data. However, in situations where the

clusterware is deciding on which node(s) to shut down, access is denied and in turn the recovery operation is delayed;

thus, data is not available for access. This is because recovery operations are not performed until the interconnect

failure causes one of the instances or nodes to fail.

Expert Oracle RAC Performance Diagnostics and Tuning

Search WWH ::

Custom Search

Home