Information Technology Reference
In-Depth Information
and further improves performance, bringing it very close (6-2% slowdown depending on
optimizations) to a full pipeline operating on uncompressed operands [ 44 ]. The byte-parallel
pipeline brings us back to the first narrow-width technique which gates unused high-order
bits albeit operating at a different granularity (at the byte level) and without requiring that
significant bits be consecutive LSBs.
4.3.3 Further Reading on Narrow Width Operands
The idea of narrow width values has been applied to other structures as well. Ergin, Balkan,
Ghose, and Ponomarev apply it to register files [ 72 ]. The intent is not so much to reduce power
consumption, but to alleviate register pressure by making better use of the available physical
registers. Similarly to packing two narrow values in the inputs of functional units or packing
compressed lines in caches, multiple narrow values are packed in registers.
A number of these values can be packed in a register either “conservatively” or “specu-
latively.” Conservatively means that a value is packed only after it is classified as narrow. This
happens after a value is created by a functional unit. When a narrow value is packed in a
different register than the one it was destined for, the register mapping for the packed value
is updated in all the in-flight instructions. In contrast, “speculative packing” takes place in the
register renaming stage, without certain knowledge of the width of the packed value. Packing
and physical register assignment is performed by predicting the output width of instructions.
The prediction history (per instruction) is kept in the instruction cache. The technique works
well for performance—increases IPC by 15% in the SPEC2000—but may not offer significant
advantages for power due to its complexity.
A different approach is followed in the work of Rochecouste, Pokam, and Seznec [ 192 ].
What they propose is to design a processor with dedicated narrow width datapaths—a width-
partitioned microarchitecture (WPM). This is a work steering technique for this type of excess
activity and is detailed in Section 4.13.
Finally, a scheme to pack multiple compressed instructions to improve instruction fetch
bandwidth and power has been proposed by Hines, Green, Tyson, and Whalley [ 100 ]. But
because this scheme uses Frequent Value Compression, which is explained next, we leave the
details for the end of Section 4.4.
4.4 IDLE-WIDTH SWITCHING ACTIVITY: CACHES
Techniques addressing idle-width activity can be also extended to cache operations (reading
and writing the cache). For example, power can be saved by accessing only the significant or
the compressed part of a word. This results in reading or writing fewer bits, and corresponds to
clock gating unused parts of the ALU or the datapath. Alternatively, multiple cache lines can be
compressed and packed in the space of an uncompressed line. This improves the performance
Search WWH ::




Custom Search