Optimizing Capacitance and Switching Activity to Reduce Dynamic Power - Computer Architecture Techniques for Power-Efficiency

Information Technology Reference

In-Depth Information

Designing a successful filter cache is a matter of thorough exploration of the design space

to find the points with acceptable performance loss given the power benefits. Kin et al. use their

own power models for the cache and parameters for an older 180 nm 3.3 V technology [ 142 ].

For MediaBench workloads they observe that for very small filter cache sizes, the increase in

hit rate of a fully-associative organization over a direct-mapped organization is not enough to

offset its increased power consumption. Thus, for their setup, a fully-associative filter cache

is not a good idea. Best results are reported with 128-Byte to 256-Byte direct-mapped filter

caches. Taking this work further, one systematically could size the entire memory hierarchy to

minimize EDP for specific workloads.

4.10.3 Loop Cache

The counterpart of the filter cache, but for instructions, is the loop cache or loop buffer . The loop

cache is designed to hold small loops commonly found in media and DSP workloads [ 10 , 150 ,

24 ]. In contrast to the filter cache which is a full-fledged cache, albeit tiny, the loop cache —

or, more accurately, buffer— is typically just a piece of SRAM that is software or compiler

controlled (a canonical example is found in Lucent's DSP16000 core [ 10 ]).

A small loop is loaded in the loop buffer under program control and execution resumes

fetching instructions from the loop buffer rather than from the usual fetch path—which might

include an instruction L1—until the loop finishes. The loop buffer being a tiny piece of

RAM is very efficient in supplying instructions, avoiding the accesses to the much more

power consuming instruction L1. Because the loop buffer caches a small block of consecutive

instructions, no tags and no tag-comparisons are needed for addressing its contents. Instead,

only relative addressing from the start of the loop is enough to generate an index to correctly

access all the loop instructions in the buffer. Lack of tags and tag comparisons makes the loop

buffer far more efficient than a typical cache, even one of the same size.

There are also proposals for fully-automatic loop caches which detect small loops at

run-time and install them in the loop cache dynamically [ 150 , 110 , 25 , 232 ]. However, such

dynamic proposals, although they enhance the generality of the loop cache at the expense of

additional hardware, are not critical for the DSP and embedded world where loop buffers have

been successfully deployed. This is because in a controlled software environment, the most

efficient solution is usually preferable for cost reasons.

In contrast, a fully automatic loop buffer appears in Intel's Core 2 architecture [ 110 ].

Intel embeds the loop buffer in the Instruction Queue. A hardware loop detection mechanism,

called Loop Stream Detector ( LSD ), detects small loops already inside the 18-deep instruction

queue. Once a loop is detected, instructions for subsequent loop iterations are streamed from

the IQ without any external fetching, until a misprediction on the loop branch is detected.

This not only speeds up instruction fetch but at the same time saves considerable energy by not

Search WWH ::

Custom Search

Home