Database Reference
In-Depth Information
The reasons for media failures could be a bad disk, controller failures, mirrored disk failures, block corruptions,
or a power surge. Depending on the type of failure, a data file, tablespace, or the database could be affected. The
extent of damage to the specific area will determine the amount of time that the media would be offline and access
will be interrupted.
Database operation after a media failure of online redo log files or control files depends on whether the online
redo log or control file has been set up with multiplexing. Storing the multiplexed files on separate diskgroups protects
the copies from failures. For example, if a media failure damages one of the diskgroups of a multiplexed online redo
log file, then database operation will continue from the other diskgroup without significant interruption. On the
other hand, if the files were not multiplexed, damage to the single copy of the redo log file could cause the database
operation to halt and may cause permanent loss of data.
All other types of media failures cause interruption of business if appropriate methods of business system
protection are not provided. Oracle technology and maximum availability architecture solutions help protect business
continuity during such media failures.
Protecting the Database
Maximum Availability Architecture (MAA) solutions from Oracle include RAC and Oracle data guard. Using these
technologies, data is copied to a remote location to a close to identical hardware configuration and applied on a real-
time basis. When failures (such as media failures) cause interruption of business, the database access locations can
be switched over from the primary to the standby location providing continued availability. In Oracle Database 10g
Release 2, a new feature was introduced called fast-start failover providing switchover of primary to standby location
by allowing the original primary to act as standby, making failback operations seamless.
Recovery from media failures also depends on the type of media failure. Accordingly, either data file recovery,
tablespace recovery, or database recovery is performed on the primary instance, returning it to a useable state.
Testing Hardware for Availability
As we have seen previously, RAC configuration has several components in its configuration. Whereas some of
the components—such as the interconnect, nodes, storage, and so forth—are protected from failures by adding
redundant infrastructure, other components such as the instance and database are protected by adding database
features such as policy managed, data guard, and so forth.
Irrespective of the type of component and the type of failure that could occur in database configuration, it's
important that all components are configured right and that they are validated before implementing them into
production. To accomplish this, it's important that all components are tested for availability. In Chapter 1, we
discussed very briefly the RAP methodology or procedure. If implemented, the methodology involves seven Phases
(RAP) of testing. Among the seven phases, RAP Phase I, RAP Phase III, RAP Phase VI, and RAP Phase VII of the
methodology focuses on availability testing.
RAP Phase I
During this step of testing, the various failure points will have to be tested to ensure that the RAC database will
continue to function either as a single instance or as a cluster depending on where the failure has occurred. For
example, from where the node failure occurred, the remaining nodes in the cluster should continue to function.
Similarly, when a network switches to the storage array and fails, the redundant switch should continue to operate.
Tests should be performed during load; meaning failures should be simulated considering they could happen in live
production environments with user activity.
 
Search WWH ::




Custom Search