Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

writes, the difference is considerably larger, with the SDRAM being at least 10 and as much

as 100 times faster than Flash depending on the circumstances.

The rapid improvements in high-density Flash in the past decade have made the technology a

viable part of memory hierarchies in mobile devices and as solid-state replacements for disks.

As the rate of increase in DRAM density continues to drop, Flash could play an increased role

in future memory systems, acting as both a replacement for hard disks and as an intermediate

storage between DRAM and disk.

Enhancing Dependability In Memory Systems

Large caches and main memories significantly increase the possibility of errors occurring

both during the fabrication process and dynamically, primarily from cosmic rays striking a

memory cell. These dynamic errors, which are changes to a cell's contents, not a change in the

circuitry, are called soft errors . All DRAMs, Flash memory, and many SRAMs are manufac-

tured with spare rows, so that a small number of manufacturing defects can be accommodated

by programming the replacement of a defective row by a spare row. In addition to fabrication

errors that must be fixed at configuration time, hard errors , which are permanent changes in

the operation of one of more memory cells, can occur in operation.

Dynamic errors can be detected by parity bits and detected and fixed by the use of Error

Correcting Codes (ECCs). Because instruction caches are read-only, parity suffices. In larger

data caches and in main memory, ECC is used to allow errors to be both detected and cor-

rected. Parity requires only one bit of overhead to detect a single error in a sequence of bits.

Because a multibit error would be undetected with parity, the number of bits protected by a

parity bit must be limited. One parity bit per 8 data bits is a typical ratio. ECC can detect two

errors and correct a single error with a cost of 8 bits of overhead per 64 data bits.

In very large systems, the possibility of multiple errors as well as complete failure of a single

memory chip becomes significant. Chipkill was introduced by IBM to solve this problem, and

many very large systems, such as IBM and SUN servers and the Google Clusters, use this tech-

nology. (Intel calls their version SDDC.) Similar in nature to the RAID approach used for disks,

Chipkill distributes the data and ECC information, so that the complete failure of a single

memory chip can be handled by supporting the reconstruction of the missing data from the

remaining memory chips. Using an analysis by IBM and assuming a 10,000 processor server

with 4 GB per processor yields the following rates of unrecoverable errors in three years of

operation:

■ Parity only—about 90,000, or one unrecoverable (or undetected) failure every 17 minutes

■ ECC only—about 3500, or about one undetected or unrecoverable failure every 7.5 hours

■ Chipkill—6, or about one undetected or unrecoverable failure every 2 months

Another way to look at this is to find the maximum number of servers (each with 4 GB) that

can be protected while achieving the same error rate as demonstrated for Chipkill. For par-

ity, even a server with only one processor will have an unrecoverable error rate higher than

a 10,000-server Chipkill protected system. For ECC, a 17-server system would have about the

same failure rate as a 10,000-server Chipkill system. Hence, Chipkill is a requirement for the

50,000 to 100,00 servers in warehouse-scale computers (see Section 6.8 of Chapter 6).

Search WWH ::

Custom Search

Home