Managing and Optimizing a Complex RAC Environment - Expert Oracle RAC 12c

Database Reference

In-Depth Information

Understanding, Debugging, and Preventing Node Evictions

Node Evictions—Synopsis and Overview

A node eviction is the mechanism/process (piece of code) designed within the Oracle Clusterware to ensure cluster

consistency and maintain overall cluster health by removing the node(s) that either suffers critical issues or doesn't

respond to other nodes' requests in the cluster in a timely manner. For example, when a node in the cluster is hung or

suffering critical problems, such as network latency or disk latency to maintain the heartbeat rate within the internal

timeout value, or if the cluster stack or clusterware is unhealthy, the node will leave the cluster and do a fast self-

reboot to ensure overall cluster health. When a node doesn't respond to another node's request in a timely manner,

the node will receive a position packet through disk/network with the instruction to leave the cluster by killing itself.

When the problematic node reads the position (kill) pockets, it will evict and leave the cluster. Thereby, the evicted

node then will then perform a fast reboot to ensure a healthy environment between the nodes in the cluster. A fast

reboot generally doesn't wait to flush the pending I/O; therefore, it will be fenced right after the node reboot.

Debugging Node Eviction

Node eviction is indeed one of the worst nightmares of RAC DBAs, posing never-ending challenges on every

occurrence. Typically, when you maintain a complex or large-scale cluster setup with a huge number of nodes,

frequent node eviction probabilities are inevitable and can be anticipated. Therefore, node eviction is one of the key

areas to which you have to pay close attention.

In general, a node eviction means unscheduled server downtime, and an unscheduled downtime could cause

service disruption; frequent node eviction will impact an organization's overall reputation too. In this section, we will

present you with the most common symptoms that lead to a node eviction, and also what are the crucial cluster log

files in context to be verified to analyze/debug the issue to find out the root cause for the node eviction.

Since you have been a cluster administrator for a while, we are pretty sure that you might have confronted a node

eviction occurrence at least once in your environment. The generic node eviction information logged in some of the

cluster logs sometimes doesn't actually provide a right direction for the actual root cause of the problem. Therefore,

you need to gather and refer to various types of log file information from cluster, platforms, OS Watcher files, etc., in

order to find the root cause of the issue.

A typical warning message, outlined hereunder, about a particular node being being evicted will be printed in

the cluster alert.log of a surviving node just few seconds before the node eviction. Though you can't prevent the

node eviction happening in such a short span, it will bring the warning message to your attention so that you will be

informed about which node is about to leave the cluster:

[ohasd(6525)]CRS-8011:reboot advisory message from host: node04, component: cssmonit, with time

stamp: L-2013-03-17-05:24:38.904

[ohasd(6525)]CRS-8013:reboot advisory message text: Rebooting after limit 28348 exceeded; disk

timeout 28348, network timeout 27829,

last heartbeat from CSSD at epoch seconds 1363487050.476, 28427 milliseconds ago based on invariant

clock value of 4262199865

2013-03-17 05:24:53.928

[cssd(7335)]CRS-1612:Network communication with node node04 (04) missing for 50% of timeout

interval. Removal of this node from cluster in 14.859 seconds

2013-03-17 05:25:02.028

[cssd(7335)]CRS-1611:Network communication with node node04 (04) missing for 75% of timeout

interval. Removal of this node from cluster in 6.760 seconds

2013-03-17 05:25:06.068

[cssd(7335)]CRS-1610:Network communication with node node04 (04) missing for 90% of timeout

interval. Removal of this node from cluster in 2.720 seconds

Expert Oracle RAC 12c

Search WWH ::

Custom Search

Home