Information Technology Reference
In-Depth Information
Designing a successful filter cache is a matter of thorough exploration of the design space
to find the points with acceptable performance loss given the power benefits. Kin et al. use their
own power models for the cache and parameters for an older 180 nm 3.3 V technology [ 142 ].
For MediaBench workloads they observe that for very small filter cache sizes, the increase in
hit rate of a fully-associative organization over a direct-mapped organization is not enough to
offset its increased power consumption. Thus, for their setup, a fully-associative filter cache
is not a good idea. Best results are reported with 128-Byte to 256-Byte direct-mapped filter
caches. Taking this work further, one systematically could size the entire memory hierarchy to
minimize EDP for specific workloads.
4.10.3 Loop Cache
The counterpart of the filter cache, but for instructions, is the loop cache or loop buffer . The loop
cache is designed to hold small loops commonly found in media and DSP workloads [ 10 , 150 ,
24 ]. In contrast to the filter cache which is a full-fledged cache, albeit tiny, the loop cache —
or, more accurately, buffer— is typically just a piece of SRAM that is software or compiler
controlled (a canonical example is found in Lucent's DSP16000 core [ 10 ]).
A small loop is loaded in the loop buffer under program control and execution resumes
fetching instructions from the loop buffer rather than from the usual fetch path—which might
include an instruction L1—until the loop finishes. The loop buffer being a tiny piece of
RAM is very efficient in supplying instructions, avoiding the accesses to the much more
power consuming instruction L1. Because the loop buffer caches a small block of consecutive
instructions, no tags and no tag-comparisons are needed for addressing its contents. Instead,
only relative addressing from the start of the loop is enough to generate an index to correctly
access all the loop instructions in the buffer. Lack of tags and tag comparisons makes the loop
buffer far more efficient than a typical cache, even one of the same size.
There are also proposals for fully-automatic loop caches which detect small loops at
run-time and install them in the loop cache dynamically [ 150 , 110 , 25 , 232 ]. However, such
dynamic proposals, although they enhance the generality of the loop cache at the expense of
additional hardware, are not critical for the DSP and embedded world where loop buffers have
been successfully deployed. This is because in a controlled software environment, the most
efficient solution is usually preferable for cost reasons.
In contrast, a fully automatic loop buffer appears in Intel's Core 2 architecture [ 110 ].
Intel embeds the loop buffer in the Instruction Queue. A hardware loop detection mechanism,
called Loop Stream Detector ( LSD ), detects small loops already inside the 18-deep instruction
queue. Once a loop is detected, instructions for subsequent loop iterations are streamed from
the IQ without any external fetching, until a misprediction on the loop branch is detected.
This not only speeds up instruction fetch but at the same time saves considerable energy by not
Search WWH ::




Custom Search