Flexicache: Highly Reliable and Low Power Cache under Supply Voltage Scaling - High Performance Computing

Information Technology Reference

In-Depth Information

- Flexicache allows cache operating down to 320 mV (10% failure rate) by

presenting, on average, 63% energy reduction in cache operations. The area

overhead of Flexicache is only 12% compared to a typical L1 cache.

2 Background and Related Work

In this section, we first explain the nomenclatures of failures in memory struc-

tures. Then we present the previous schemes used for scaling V dd .

Memory Failures: Bit failures are classified into two broad categories [12]:

Persistent Failures: The random variation in the number and location of dopant

atoms in the channel region of the device leads to the random variations in

transistor threshold voltage. It causes threshold voltage mismatch between the

transistors close to each other. In a SRAM cell, a mismatch in the strength be-

tween the neighbouring transistors caused by intra-die variations can result in

the failure of the cell [4]. A cell failure can occur due to: (1) An increase in the cell

access time, (2) unstable read operation, (3) unstable write operation, (4) failure

in the data holding capability of the cell. Further details can be found in [30].

On the other side, open or short circuits cause irreversible physical changes in

the semiconductor devices. These permanent failures tend to occur early in the

processor lifetime due to manufacturing faults (called the infant mortality), or

late in the lifetime due to thermal and process related stress. The location of a

persistent failure is random and independent of whether the neighbouring bit is

faulty or not [20]. The locations of persistently defective bits can be detected by

performing built-in self test (BIST) [17].

Non-Persistent Failures: Radiation events or power supply noise can cause a

bit flip and corrupt a data stored in a device until a new data is written [8].

As transistor dimensions and operating voltages shrink, sensitivity to radiation

events increases drastically. On the other side, process variation or in-progress

wear-out, combined with voltage and temperature fluctuations might cause cor-

related faults of short duration. They are termed intermittent faults (or erratic

failures), that last from several cycles to several seconds [13]. Diagnosing an in-

termittent fault by BIST is hard since it does not persist and conditions that

cause the fault are hard to regenerate. As V dd decreases, the bit failure rate

increases rapidly for both intermittent faults and persistent failures [23,12].

Related Work: In this section, we discuss architecture-based schemes uti-

lized under scaling voltage and compare their main characteristics with Flexi-

cache in Table 1. Orthogonal Latin Square Code (OLSC) [18] is a state of the

art ECC scheme used for level-1 caches when the supply voltage is lower than

the safe margin. Multi-Bit Segmented ECC (MS-ECC) [12] utilizes OLSC at a

finer granularity in order to increase the error correction capability of OLSC to

be used for ultra-low voltage level. Thus MS-ECC can reduce the supply voltage

until 350 mV in 35nm technology by providing 6.5% useful cache capacity (We

define useful cache capacity as the portion of the cache which is not disabled) [23].

Kim, et al. [19], propose two-dimensional (2D) ECC to correct multi-bit errors

with a minimum area overhead in check bits. However, the correction capability

High Performance Computing

Search WWH ::

Custom Search

Home