Digital Signal Processing Reference
In-Depth Information
processor die, which can directly increase the system parallelism and potentially
improve the overall computing system performance without increasing the chip
footprint. In this context, the 3D DRAM has a heterogeneous structure and covers
two levels of the entire memory hierarchy. Because L2 cache demands very short
access latency, the 3D DRAM L2 cache should be particularly customized in
order to achieve a comparable access latency as its on-chip SRAM counterpart.
5.2
3D DRAM L2 Cache
The 3D VLIW architecture configuration as shown in Fig. 3 b migrates the on-
chip L2 cache into the 3D DRAM domain. Since L2 cache access latency plays
a critical role in determining the overall computing system performance, one may
intuitively argue that, compared with on-chip SRAM L2 cache, 3D DRAM L2
cache may suffer from much longer access latency and hence result in significant
performance degradation. In this section, we show that this intuitive argument may
not necessarily hold true. In particular, as we increase the L2 cache capacity and the
number of DRAM dies, the 3D DRAM L2 cache may have a latency comparable or
even shorter than an SRAM L2 cache.
Commercial DRAM is typically much slower than SRAM mainly because, being
commodity, DRAM has been always optimized for density and cost rather than
speed. The speed of DRAM can be greatly improved by two approaches at the cost
of density and cost, including:
1. We can reduce the size of each individual DRAM sub-array to reduce the memory
access latency at the penalty of storage density. With shorter lengths of word-
lines and bit-lines, a smaller DRAM sub-array can directly lead to reduced access
latency because of the reduced load of the peripheral circuits.
2. We can adopt the multiple threshold voltage (multi-V th ) technique that has been
widely used in logic circuit design [ 30 ] , i.e., we still use high-V th transistors in
DRAM cells to maintain a sufficiently low DRAM cell leakage current, while
using low-V th transistors in peripheral circuits and H-tree buffers to reduce the
latency. Such multi-V th design is not typically used in commodity DRAM since
it will increase leakage power consumption of peripheral circuits and, more
importantly, will complicate the DRAM fabrication process and hence incur a
higher cost.
Moreover, as we increase the L2 cache capacity, the global routing will play a
bigger role in determining the overall L2 cache access latency. By using the above
presented 3D DRAM design strategy, we can directly reduce the latency incurred
by global routing, which will further contribute to reducing 3D DRAM L2 cache
access latency compared with 2D SRAM L2 cache.
To evaluate the above arguments, Fig. 4 shows the comparison of access latency
of 2D SRAM, single-V th 2D DRAM, and multi-V th 2D DRAM, under different L2
cache capacities, at 65 nm node. The results show that, as we increase the capacity of
Search WWH ::




Custom Search