Database Reference
In-Depth Information
Understanding, Debugging, and Preventing Node Evictions
Node Evictions—Synopsis and Overview
A node eviction is the mechanism/process (piece of code) designed within the Oracle Clusterware to ensure cluster
consistency and maintain overall cluster health by removing the node(s) that either suffers critical issues or doesn't
respond to other nodes' requests in the cluster in a timely manner. For example, when a node in the cluster is hung or
suffering critical problems, such as network latency or disk latency to maintain the heartbeat rate within the internal
timeout value, or if the cluster stack or clusterware is unhealthy, the node will leave the cluster and do a fast self-
reboot to ensure overall cluster health. When a node doesn't respond to another node's request in a timely manner,
the node will receive a position packet through disk/network with the instruction to leave the cluster by killing itself.
When the problematic node reads the position (kill) pockets, it will evict and leave the cluster. Thereby, the evicted
node then will then perform a fast reboot to ensure a healthy environment between the nodes in the cluster. A fast
reboot generally doesn't wait to flush the pending I/O; therefore, it will be fenced right after the node reboot.
Debugging Node Eviction
Node eviction is indeed one of the worst nightmares of RAC DBAs, posing never-ending challenges on every
occurrence. Typically, when you maintain a complex or large-scale cluster setup with a huge number of nodes,
frequent node eviction probabilities are inevitable and can be anticipated. Therefore, node eviction is one of the key
areas to which you have to pay close attention.
In general, a node eviction means unscheduled server downtime, and an unscheduled downtime could cause
service disruption; frequent node eviction will impact an organization's overall reputation too. In this section, we will
present you with the most common symptoms that lead to a node eviction, and also what are the crucial cluster log
files in context to be verified to analyze/debug the issue to find out the root cause for the node eviction.
Since you have been a cluster administrator for a while, we are pretty sure that you might have confronted a node
eviction occurrence at least once in your environment. The generic node eviction information logged in some of the
cluster logs sometimes doesn't actually provide a right direction for the actual root cause of the problem. Therefore,
you need to gather and refer to various types of log file information from cluster, platforms, OS Watcher files, etc., in
order to find the root cause of the issue.
A typical warning message, outlined hereunder, about a particular node being being evicted will be printed in
the cluster alert.log of a surviving node just few seconds before the node eviction. Though you can't prevent the
node eviction happening in such a short span, it will bring the warning message to your attention so that you will be
informed about which node is about to leave the cluster:
[ohasd(6525)]CRS-8011:reboot advisory message from host: node04, component: cssmonit, with time
stamp: L-2013-03-17-05:24:38.904
[ohasd(6525)]CRS-8013:reboot advisory message text: Rebooting after limit 28348 exceeded; disk
timeout 28348, network timeout 27829,
last heartbeat from CSSD at epoch seconds 1363487050.476, 28427 milliseconds ago based on invariant
clock value of 4262199865
2013-03-17 05:24:53.928
[cssd(7335)]CRS-1612:Network communication with node node04 (04) missing for 50% of timeout
interval. Removal of this node from cluster in 14.859 seconds
2013-03-17 05:25:02.028
[cssd(7335)]CRS-1611:Network communication with node node04 (04) missing for 75% of timeout
interval. Removal of this node from cluster in 6.760 seconds
2013-03-17 05:25:06.068
[cssd(7335)]CRS-1610:Network communication with node node04 (04) missing for 90% of timeout
interval. Removal of this node from cluster in 2.720 seconds
 
Search WWH ::




Custom Search