Database Reference
In-Depth Information
to communicate with each other. However, oradb1 and oradb2 are not able communicate with oradb3 and oradb4 .
That is, one set of nodes in the cluster is unable to communicate with the other set of nodes in the cluster. The cluster
creates a invisible grouping (cohorts) between nodes, which could potentially cause corruption and should be resolved.
Oracle Clusterware handles the split-brain scenario by terminating all the nodes in the smaller cohort. If both of
the cohorts are the same size, the cohort with the lowest-numbered node in it survives. The clusterware identifies the
largest cohort and aborts all the nodes that do not belong to that cohort. In a split-brain node eviction, the following
message is present in the OCSSD log ( $GRID_HOME/log/ssky3l12p2/cssd/ocssd.log ) of the evicted node:
2014-01-15 22:34:08.960: [CSSD][1117178176] clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
Node Reboots
Evictions are caused due to system faults, such as being unable to reach the participating node in the cluster, while
reboots occur due to a lack of resources, for example, high CPU utilization. There are several reasons for a node reboot.
Node reboot due to losing access to the majority of voting disks (loss of quorum). To create
a quorum during conflict resolution, having an odd number of voting disks will help resolve
decision-making scenarios by allowing the clusterwares to vote. It's for this reason that, as a
best practice, voting disks should be configured in groups of 3 or 5, depending on the number
of nodes participating in the cluster.
Node reboots due to a node hang or perceived/false node hang. This situation can arise
when networks or disk I/O channels are busy and the heartbeats are not able to complete
in the required time. If this happens, the misscount and timeout numbers would be false
information, thus causing the nodes to reboot. Busy interconnects or networks are caused
by high-latency, low-bandwidth networks or because of inefficient SQL statements. SQL
optimization, instance affinity, and service affinity could help reduce some of the busy
interconnect/network traffic.
Node reboots due to Global Cache/Enqueue Service Heartbeat Monitor, also called Lock
Manager Heartbeat ( LMHB ) group member kill request. LMHB is an RAC database background
process. Apart from forcing a node eviction during database hangs, the function of LMHB is to
monitor the heartbeats of LMON , LMD , and LMSn processes. Like other background processes, the
activities of LMHB are recorded in the trace directory of the RDBMS instance.
The kill request from LMHB occurs when any critical background ( LMON , LMD , LMSn ) process
is hung or stuck during operation. Searching the background process trace files for
StatCheckCPU could capture this.
cat SSKYDB_1_lmhb_7768.trc | grep StatCheckCPU
kjgcr_StatCheckCPU: cpu based load is high, currently 56, average 18
kjgcr_StatCheckCPU: cpu based load is high, currently 53, average 18
kjgcr_StatCheckCPU: cpu based load is high, currently 48, average 18
kjgcr_StatCheckCPU: cpu based load is high, currently 56, average 18
kjgcr_StatCheckCPU: cpu based load is high, currently 51, average 18
kjgcr_StatCheckCPU: cpu based load is high, currently 65, average 16
kjgcr_StatCheckCPU: cpu based load is high, currently 65, average 16
kjgcr_StatCheckCPU: runq based load is high, currently 212, average 44
kjgcr_StatCheckCPU: runq based load is high, currently 276, average 44
kjgcr_StatCheckCPU: runq based load is high, currently 253, average 44
kjgcr_StatCheckCPU: runq based load is high, currently 232, average 44
kjgcr_StatCheckCPU: runq based load is high, currently 317, average 44
 
Search WWH ::




Custom Search