Testing for Availability - Expert Oracle RAC Performance Diagnostics and Tuning

Database Reference

In-Depth Information

The reasons for media failures could be a bad disk, controller failures, mirrored disk failures, block corruptions,

or a power surge. Depending on the type of failure, a data file, tablespace, or the database could be affected. The

extent of damage to the specific area will determine the amount of time that the media would be offline and access

will be interrupted.

Database operation after a media failure of online redo log files or control files depends on whether the online

redo log or control file has been set up with multiplexing. Storing the multiplexed files on separate diskgroups protects

the copies from failures. For example, if a media failure damages one of the diskgroups of a multiplexed online redo

log file, then database operation will continue from the other diskgroup without significant interruption. On the

other hand, if the files were not multiplexed, damage to the single copy of the redo log file could cause the database

operation to halt and may cause permanent loss of data.

All other types of media failures cause interruption of business if appropriate methods of business system

protection are not provided. Oracle technology and maximum availability architecture solutions help protect business

continuity during such media failures.

Protecting the Database

Maximum Availability Architecture (MAA) solutions from Oracle include RAC and Oracle data guard. Using these

technologies, data is copied to a remote location to a close to identical hardware configuration and applied on a real-

time basis. When failures (such as media failures) cause interruption of business, the database access locations can

be switched over from the primary to the standby location providing continued availability. In Oracle Database 10g

Release 2, a new feature was introduced called fast-start failover providing switchover of primary to standby location

by allowing the original primary to act as standby, making failback operations seamless.

Recovery from media failures also depends on the type of media failure. Accordingly, either data file recovery,

tablespace recovery, or database recovery is performed on the primary instance, returning it to a useable state.

Testing Hardware for Availability

As we have seen previously, RAC configuration has several components in its configuration. Whereas some of

the components—such as the interconnect, nodes, storage, and so forth—are protected from failures by adding

redundant infrastructure, other components such as the instance and database are protected by adding database

features such as policy managed, data guard, and so forth.

Irrespective of the type of component and the type of failure that could occur in database configuration, it's

important that all components are configured right and that they are validated before implementing them into

production. To accomplish this, it's important that all components are tested for availability. In Chapter 1, we

discussed very briefly the RAP methodology or procedure. If implemented, the methodology involves seven Phases

(RAP) of testing. Among the seven phases, RAP Phase I, RAP Phase III, RAP Phase VI, and RAP Phase VII of the

methodology focuses on availability testing.

RAP Phase I

During this step of testing, the various failure points will have to be tested to ensure that the RAC database will

continue to function either as a single instance or as a cluster depending on where the failure has occurred. For

example, from where the node failure occurred, the remaining nodes in the cluster should continue to function.

Similarly, when a network switches to the storage array and fails, the redundant switch should continue to operate.

Tests should be performed during load; meaning failures should be simulated considering they could happen in live

production environments with user activity.

Search WWH ::

Custom Search

Home