Information Technology Reference
In-Depth Information
global worldline
local (segmented) wordline
pre
decoder
Sub-array 2
Sub-array 1
Sub-array 3
Sub-array 0 (512KB)
precharge
Way3
Way2
Way1
Way0
column MUX
sense amps
FIGURE 4.17: Physical cache organization. Adapted from [ 21 ].
The physical cache hosts a virtual two-level hierarchy. Virtual L1 and L2 caches are
created within the physical cache by assigning ways to each level. Table 4.8 shows the possible
assignments, along with the resulting size, associativity, and access time (in cycles) of the
virtual L1. An important difference from Albonesi's first proposal which advocated changing
the clock frequency to suit a faster or slower L1 [ 7 ], is that the clock frequency remains fixed.
What changes is the access latency, in cycles, for both the L1 and the L2. Latency changes in
half-cycle increments, assuming that data can be captured using both phases of the clock as in
the Alpha 21264.
Similarly to the first proposal [ 7 ], the virtual caches are exclusive. On each access, one of
the subarrays is enabled by predecoding a Subarray Select field in the requested address. Within
the enabled subarray, only the L1 section (the shaded ways in Table 4.8) is initially accessed.
In case of a miss, the L2 section is then accessed. If there is a hit in the L2, the requested data
are moved to the L1 by swapping places with the data already read during the L1 miss. If there
is a miss in the L2, data returning form memory is put in the L1 section; any displaced L1 data
are moved into the L2 section.
Feedback control: configuration searches: The justification behind this L1/L2 partitioning is
that it can adjust to different tolerances for hit and miss latencies. For programs, or better yet
program phases , that have a very low tolerance in hit latency, a fast L1 can be employed even if
it does not yield a very high hit rate. On the other hand, if a program (or program phase) can
tolerate somewhat higher hit latency but cannot tolerate a large miss latency, then a larger L1
(albeit somewhat slower) might be the right solution.
The goal is therefore to find a configuration of the virtual caches that yields the right
balance between hit latency and miss rate, per program phase . Balasubramonian et al. propose
a method to achieve this balance but leave open the choice for a software or a hardware
implementation. Their method works as follows. Performance statistics (miss rate, IPC, and
 
Search WWH ::




Custom Search