Information Technology Reference
In-Depth Information
accessing the instruction (or trace cache) and by not decoding the same loop instructions over
and over again.
4.10.4 Trace Cache
The concept of storing a trace—a group of consecutive instructions as they appear in dynamic
execution —and reusing it, was first published by Rotenberg, Smith, and Bennett [ 193 ]as
a means to increase instruction fetch bandwidth. In this respect it is closely related to the
loop cache. However, the trace cache goes further. The idea is to embed branch prediction
in instruction fetching and fetch large stretches of instructions despite abrupt changes in the
control flow. Although the idea works well for what it was intended for, it found a much more
important place as a mechanism to reduce energy consumption for most of the front end of
the Pentium-4 processor. This is due to the CISC nature of the IA-32 (x86) instruction set
executed by the Pentium-4 [ 110 ].
The particularities of a complex instruction set with variable-length instructions such
as the IA-32 make it extremely difficult to execute it in a dynamically scheduled superscalar
core. Intel's solution is to translate the IA-32 instructions into RISC-like instructions called
uops . The uops follow the RISC philosophy of fixed length instructions (112 bit long) and
of a load-store execution model. IA-32 instructions which can access memory are typically
translated into sequences of load-modify-store uops.
The work required in such a front end is tremendous and this is reflected in the large
percentage (28%) of the of the total power devoted to the front end. Even before the translation
from IA-32 to uop instructions takes place, considerable work is required just to fetch IA-
32 variable-length (1-15 bytes) instructions, detect multiple prefix bytes, align, etc. Decoding
multiple IA-32 instructions per cycle and emitting uops to the rename stage is one of the most
power consuming operations in the Pentium-4 processor.
To address this problem Solomon, Mendelson, Orenstien, Almog, and Ronen describe a
trace cache that can eliminate the repeated work of fetching, decoding, and translating the same
instructions over and over again [ 210 ]. Called the Micro-Operation Cache (µC), the concept
was implemented as the trace cache of the Pentium-4. The reason why it works so well in this
environment is that traces are created after the IA-32 instructions are decoded and translated
in uops. Traces are uop sequences and are directly issued as such.
Figure 4.34 shows the concept of the Micro-Operation Cache (adapted from [ 210 ]).
The µC fill path starts after the instruction decode. A fill buffer is filled with uops until the
first branch is encountered. In this respect, the µC cache is more of a basic block history buffer
(see BHB, [ 107 ]) than a trace cache, but this is not an inherent limitation in the design—it
was so chosen just to make it as efficient as possible. Another interesting characteristic of the
µ C design is that although a hit can be determined in the µ C during the first pipeline stage,
Search WWH ::




Custom Search