Hardware Reference
In-Depth Information
coupled to a Clock Synchronization Unit (CSU) for maintaining a global time base,
and a Time Slice Controller (TSC) for controlling the access to the system bus.
Whenever an error is detected, the subsequent error processing activity of the
node is to save the error information into non-volatile memory and then turn itself
off. Upon restart, the node writes its previously saved error information to two serial
ports (one for each unit), from where it can be read for diagnosis purpose. This fea-
ture was exploited in the context of this study to precisely monitor and characterize
the consequence of the injected faults.
Three levels of error detection mechanisms (EDMs) are implemented in the
MARS nodes: (1) the hardware EDMs, (2) the system software EDMs implemented
in the operating system and support software (i.e., the Modula/R compiler), and
(3) the application-level (end-to-end) EDMs at the highest level. They are briefly
described in the following paragraphs.
Hardware EDMs Whenever an error is detected by one of the hardware EDMs,
an exception is usually raised and the two CPUs then wait for a reset issued by a
watchdog timer. This timer is the only device that may cause a reset of all devices
including the CPUs. Two categories of hardware EDMs can be distinguished: the
CPU built-in mechanisms and those provided by special hardware on the processing
board. In addition, faults can also trigger “unexpected” exceptions (i.e., neither the
EDMs built into the CPUs nor the mechanisms provided by special hardware are
mapped to these exceptions).
The EDMs built into the CPUs are: bus error, address error, illegal op-code, priv-
ilege violation, zero-divide, stack format error, non-initialized vector interrupt and
spurious interrupt. These errors cause the processor to jump to the appropriate ex-
ception handling routines, which save the error state to the non-volatile memory and
then reset the node.
The following errors are detected by mechanisms implemented by special hard-
ware on the node: silent shutdown of the CPU of the communication unit, power
failure, parity error, FIFO over/underflow, access to physically non-existing mem-
ory, write access to the real-time network at an illegal point in time (monitored by
the TSC), error of an external device and error of the other unit. We globally call
these “NMI mechanisms”, as they raise a Non-Maskable Interrupt (a specific excep-
tion number) when an error is detected. An NMI leads to the same error handling as
EDMs built into the CPUs and can only be cleared by resetting the node, which is
carried out by the watchdog timer.
System Software EDMs These mechanisms consist of mechanisms implemented
by the operating system or special system tasks. They include:
Assertions built into the operating system (OS), such as integrity checks on data
or processing time overflow
Mechanisms inserted by the compiler (i.e., Compiler Generated Run-Time Asser-
tions - CGRTA) to implement concurrent checks, such as value range overflow
of a variable and loop iteration bound overflow
When an error is detected by any of these mechanisms, a “trap” instruction is exe-
cuted leading to a node reset.
Search WWH ::




Custom Search