Oracle Clusterware Diagnosis - Expert Oracle RAC Performance Diagnostics and Tuning

Database Reference

In-Depth Information

to communicate with each other. However, oradb1 and oradb2 are not able communicate with oradb3 and oradb4 .

That is, one set of nodes in the cluster is unable to communicate with the other set of nodes in the cluster. The cluster

creates a invisible grouping (cohorts) between nodes, which could potentially cause corruption and should be resolved.

Oracle Clusterware handles the split-brain scenario by terminating all the nodes in the smaller cohort. If both of

the cohorts are the same size, the cohort with the lowest-numbered node in it survives. The clusterware identifies the

largest cohort and aborts all the nodes that do not belong to that cohort. In a split-brain node eviction, the following

message is present in the OCSSD log ( $GRID_HOME/log/ssky3l12p2/cssd/ocssd.log ) of the evicted node:

2014-01-15 22:34:08.960: [CSSD][1117178176] clssnmCheckDskInfo: Aborting local node to avoid splitbrain.

Node Reboots

Evictions are caused due to system faults, such as being unable to reach the participating node in the cluster, while

reboots occur due to a lack of resources, for example, high CPU utilization. There are several reasons for a node reboot.

•

Node reboot due to losing access to the majority of voting disks (loss of quorum). To create

a quorum during conflict resolution, having an odd number of voting disks will help resolve

decision-making scenarios by allowing the clusterwares to vote. It's for this reason that, as a

best practice, voting disks should be configured in groups of 3 or 5, depending on the number

of nodes participating in the cluster.

•

Node reboots due to a node hang or perceived/false node hang. This situation can arise

when networks or disk I/O channels are busy and the heartbeats are not able to complete

in the required time. If this happens, the misscount and timeout numbers would be false

information, thus causing the nodes to reboot. Busy interconnects or networks are caused

by high-latency, low-bandwidth networks or because of inefficient SQL statements. SQL

optimization, instance affinity, and service affinity could help reduce some of the busy

interconnect/network traffic.

•

Node reboots due to Global Cache/Enqueue Service Heartbeat Monitor, also called Lock

Manager Heartbeat ( LMHB ) group member kill request. LMHB is an RAC database background

process. Apart from forcing a node eviction during database hangs, the function of LMHB is to

monitor the heartbeats of LMON , LMD , and LMSn processes. Like other background processes, the

activities of LMHB are recorded in the trace directory of the RDBMS instance.

The kill request from LMHB occurs when any critical background ( LMON , LMD , LMSn ) process

is hung or stuck during operation. Searching the background process trace files for

StatCheckCPU could capture this.

cat SSKYDB_1_lmhb_7768.trc | grep StatCheckCPU

kjgcr_StatCheckCPU: cpu based load is high, currently 56, average 18

kjgcr_StatCheckCPU: cpu based load is high, currently 53, average 18

kjgcr_StatCheckCPU: cpu based load is high, currently 48, average 18

kjgcr_StatCheckCPU: cpu based load is high, currently 56, average 18

kjgcr_StatCheckCPU: cpu based load is high, currently 51, average 18

kjgcr_StatCheckCPU: cpu based load is high, currently 65, average 16

kjgcr_StatCheckCPU: runq based load is high, currently 212, average 44

kjgcr_StatCheckCPU: runq based load is high, currently 276, average 44

kjgcr_StatCheckCPU: runq based load is high, currently 253, average 44

kjgcr_StatCheckCPU: runq based load is high, currently 232, average 44

kjgcr_StatCheckCPU: runq based load is high, currently 317, average 44

Expert Oracle RAC Performance Diagnostics and Tuning

Search WWH ::

Custom Search

Home