Methodology - Expert Oracle RAC Performance Diagnostics and Tuning

Database Reference

In-Depth Information

If the issue was with the instance or a server crashing in the RAC environment, data related to specific modules,

such as the interconnect, data related to the heartbeat verification via the interconnect, and the heartbeat verification

against the voting disks have to be collected. For example, a detailed look at the data in the GRID infrastructure log

files may have to be analyzed after enabling debug ( crsctl debug log css "CSSD:9" ) to get the clusterware to write

more data into these log files. If this is a performance-related concern, then collecting data using a trace from the user

session would be really helpful in analyzing the issue. Tools such as Lightweight Onboard Monitor (LTOM 1 ), or at the

minimum collecting trace using event 10046, would be really helpful.

Several times instance or server crashes in a RAC environment could be due to overload on the system affecting

the overall performance of the system. In these situations, the directions could shift to availability or stability of the

cluster. However, the root cause analysis may indicate other reasons.

Area Drilldown

Drilling down further to identify the cause or area of a performance issue is probably the most critical of the steps

because with all the data collected, it's time to drill down to the actual reason that has led to the problem. Irrespective

of whether this is an instance/server crash because of overload or poorly performing module or application, the

actual problem should be identified at this stage and documented. For example, what query in the module or

application is slowing down the process, or is there a contention caused by another application (batch) that is causing

the online application to slow down?

At this level of drilldown, the details of the application area need to be identified: what service, what module, and

what action was the reason for this slowness. To get this level of detail, the DBMS_APPLICATION_INFO package discussed

earlier is a very helpful feature.

Problem Resolution

Working to resolve the performance issue is probably the most critical step. When resolving problems, database

parameters may have to be changed, host bus adaptor (HBA) controllers or networks or additional infrastructure such

as CPU or memory may have to be added, or maybe it all boils down to tuning a bad performing structured query

language (SQL) query, or making sure that the batch application does not run in the same time frame as the primary

online application, or even better if the workload can be distributed using database services to reduce resource

contention on any one server/instance causing poor response times. It is important that when fixing problems the

entire application is taken into consideration; making fixes to help one part of the application should not affect the

other parts of the application.

Testing Against Baseline

Once the problem identified has been fixed and unit tested, the code is integrated with the rest of the application

and tested to see if the performance issue has been resolved. In the case of hardware related changes or fixes, such

a test may be very hard to verify; however, if the fix is done over the weekend or during a maintenance window, the

application could be tested to ensure it is not broken due to these changes. Depending on the complexity of the

situation and maintenance window available, it will drive how extensive these tests can be. Here is a great benefit

of using database services that allow disabling usage of a certain server or database instance from regular usage or

allowing limited access to certain part of the application functionality, which could be tested using an instance or

workload until such time as it's tested and available for others to use.

1 Usage and implementation of LTOM will be discussed in Chapter 6.

Search WWH ::

Custom Search

Home