Optimizing Capacitance and Switching Activity to Reduce Dynamic Power - Computer Architecture Techniques for Power-Efficiency - page 87

Information Technology Reference

In-Depth Information

global worldline

local (segmented) wordline

pre

decoder

Sub-array 2

Sub-array 1

Sub-array 3

Sub-array 0 (512KB)

precharge

Way3

Way2

Way1

Way0

column MUX

sense amps

FIGURE 4.17: Physical cache organization. Adapted from [ 21 ].

The physical cache hosts a virtual two-level hierarchy. Virtual L1 and L2 caches are

created within the physical cache by assigning ways to each level. Table 4.8 shows the possible

assignments, along with the resulting size, associativity, and access time (in cycles) of the

virtual L1. An important difference from Albonesi's first proposal which advocated changing

the clock frequency to suit a faster or slower L1 [ 7 ], is that the clock frequency remains fixed.

What changes is the access latency, in cycles, for both the L1 and the L2. Latency changes in

half-cycle increments, assuming that data can be captured using both phases of the clock as in

the Alpha 21264.

Similarly to the first proposal [ 7 ], the virtual caches are exclusive. On each access, one of

the subarrays is enabled by predecoding a Subarray Select field in the requested address. Within

the enabled subarray, only the L1 section (the shaded ways in Table 4.8) is initially accessed.

In case of a miss, the L2 section is then accessed. If there is a hit in the L2, the requested data

are moved to the L1 by swapping places with the data already read during the L1 miss. If there

is a miss in the L2, data returning form memory is put in the L1 section; any displaced L1 data

are moved into the L2 section.

Feedback control: configuration searches: The justification behind this L1/L2 partitioning is

that it can adjust to different tolerances for hit and miss latencies. For programs, or better yet

program phases , that have a very low tolerance in hit latency, a fast L1 can be employed even if

it does not yield a very high hit rate. On the other hand, if a program (or program phase) can

tolerate somewhat higher hit latency but cannot tolerate a large miss latency, then a larger L1

(albeit somewhat slower) might be the right solution.

The goal is therefore to find a configuration of the virtual caches that yields the right

balance between hit latency and miss rate, per program phase . Balasubramonian et al. propose

a method to achieve this balance but leave open the choice for a software or a hardware

implementation. Their method works as follows. Performance statistics (miss rate, IPC, and

Next Page

Computer Architecture Techniques for Power-Efficiency

Search WWH ::

Custom Search

Home