Hardware Reference
In-Depth Information
writes, the difference is considerably larger, with the SDRAM being at least 10 and as much
as 100 times faster than Flash depending on the circumstances.
The rapid improvements in high-density Flash in the past decade have made the technology a
viable part of memory hierarchies in mobile devices and as solid-state replacements for disks.
As the rate of increase in DRAM density continues to drop, Flash could play an increased role
in future memory systems, acting as both a replacement for hard disks and as an intermediate
storage between DRAM and disk.
Enhancing Dependability In Memory Systems
Large caches and main memories significantly increase the possibility of errors occurring
both during the fabrication process and dynamically, primarily from cosmic rays striking a
memory cell. These dynamic errors, which are changes to a cell's contents, not a change in the
circuitry, are called soft errors . All DRAMs, Flash memory, and many SRAMs are manufac-
tured with spare rows, so that a small number of manufacturing defects can be accommodated
by programming the replacement of a defective row by a spare row. In addition to fabrication
errors that must be fixed at configuration time, hard errors , which are permanent changes in
the operation of one of more memory cells, can occur in operation.
Dynamic errors can be detected by parity bits and detected and fixed by the use of Error
Correcting Codes (ECCs). Because instruction caches are read-only, parity suffices. In larger
data caches and in main memory, ECC is used to allow errors to be both detected and cor-
rected. Parity requires only one bit of overhead to detect a single error in a sequence of bits.
Because a multibit error would be undetected with parity, the number of bits protected by a
parity bit must be limited. One parity bit per 8 data bits is a typical ratio. ECC can detect two
errors and correct a single error with a cost of 8 bits of overhead per 64 data bits.
In very large systems, the possibility of multiple errors as well as complete failure of a single
memory chip becomes significant. Chipkill was introduced by IBM to solve this problem, and
many very large systems, such as IBM and SUN servers and the Google Clusters, use this tech-
nology. (Intel calls their version SDDC.) Similar in nature to the RAID approach used for disks,
Chipkill distributes the data and ECC information, so that the complete failure of a single
memory chip can be handled by supporting the reconstruction of the missing data from the
remaining memory chips. Using an analysis by IBM and assuming a 10,000 processor server
with 4 GB per processor yields the following rates of unrecoverable errors in three years of
operation:
■ Parity only—about 90,000, or one unrecoverable (or undetected) failure every 17 minutes
■ ECC only—about 3500, or about one undetected or unrecoverable failure every 7.5 hours
■ Chipkill—6, or about one undetected or unrecoverable failure every 2 months
Another way to look at this is to find the maximum number of servers (each with 4 GB) that
can be protected while achieving the same error rate as demonstrated for Chipkill. For par-
ity, even a server with only one processor will have an unrecoverable error rate higher than
a 10,000-server Chipkill protected system. For ECC, a 17-server system would have about the
same failure rate as a 10,000-server Chipkill system. Hence, Chipkill is a requirement for the
50,000 to 100,00 servers in warehouse-scale computers (see Section 6.8 of Chapter 6).
Search WWH ::




Custom Search