Physical Fault Models and Fault Tolerance - Models in Hardware Testing

Hardware Reference

In-Depth Information

coupled to a Clock Synchronization Unit (CSU) for maintaining a global time base,

and a Time Slice Controller (TSC) for controlling the access to the system bus.

Whenever an error is detected, the subsequent error processing activity of the

node is to save the error information into non-volatile memory and then turn itself

off. Upon restart, the node writes its previously saved error information to two serial

ports (one for each unit), from where it can be read for diagnosis purpose. This fea-

ture was exploited in the context of this study to precisely monitor and characterize

the consequence of the injected faults.

Three levels of error detection mechanisms (EDMs) are implemented in the

MARS nodes: (1) the hardware EDMs, (2) the system software EDMs implemented

in the operating system and support software (i.e., the Modula/R compiler), and

(3) the application-level (end-to-end) EDMs at the highest level. They are briefly

described in the following paragraphs.

Hardware EDMs Whenever an error is detected by one of the hardware EDMs,

an exception is usually raised and the two CPUs then wait for a reset issued by a

watchdog timer. This timer is the only device that may cause a reset of all devices

including the CPUs. Two categories of hardware EDMs can be distinguished: the

CPU built-in mechanisms and those provided by special hardware on the processing

board. In addition, faults can also trigger “unexpected” exceptions (i.e., neither the

EDMs built into the CPUs nor the mechanisms provided by special hardware are

mapped to these exceptions).

The EDMs built into the CPUs are: bus error, address error, illegal op-code, priv-

ilege violation, zero-divide, stack format error, non-initialized vector interrupt and

spurious interrupt. These errors cause the processor to jump to the appropriate ex-

ception handling routines, which save the error state to the non-volatile memory and

then reset the node.

The following errors are detected by mechanisms implemented by special hard-

ware on the node: silent shutdown of the CPU of the communication unit, power

failure, parity error, FIFO over/underflow, access to physically non-existing mem-

ory, write access to the real-time network at an illegal point in time (monitored by

the TSC), error of an external device and error of the other unit. We globally call

these “NMI mechanisms”, as they raise a Non-Maskable Interrupt (a specific excep-

tion number) when an error is detected. An NMI leads to the same error handling as

EDMs built into the CPUs and can only be cleared by resetting the node, which is

carried out by the watchdog timer.

System Software EDMs These mechanisms consist of mechanisms implemented

by the operating system or special system tasks. They include:

Assertions built into the operating system (OS), such as integrity checks on data

or processing time overflow

Mechanisms inserted by the compiler (i.e., Compiler Generated Run-Time Asser-

tions - CGRTA) to implement concurrent checks, such as value range overflow

of a variable and loop iteration bound overflow

When an error is detected by any of these mechanisms, a “trap” instruction is exe-

cuted leading to a node reset.

Models in Hardware Testing

Search WWH ::

Custom Search

Home