Optimizing Capacitance and Switching Activity to Reduce Dynamic Power - Computer Architecture Techniques for Power-Efficiency

Information Technology Reference

In-Depth Information

and further improves performance, bringing it very close (6-2% slowdown depending on

optimizations) to a full pipeline operating on uncompressed operands [ 44 ]. The byte-parallel

pipeline brings us back to the first narrow-width technique which gates unused high-order

bits albeit operating at a different granularity (at the byte level) and without requiring that

significant bits be consecutive LSBs.

4.3.3 Further Reading on Narrow Width Operands

The idea of narrow width values has been applied to other structures as well. Ergin, Balkan,

Ghose, and Ponomarev apply it to register files [ 72 ]. The intent is not so much to reduce power

consumption, but to alleviate register pressure by making better use of the available physical

registers. Similarly to packing two narrow values in the inputs of functional units or packing

compressed lines in caches, multiple narrow values are packed in registers.

A number of these values can be packed in a register either “conservatively” or “specu-

latively.” Conservatively means that a value is packed only after it is classified as narrow. This

happens after a value is created by a functional unit. When a narrow value is packed in a

different register than the one it was destined for, the register mapping for the packed value

is updated in all the in-flight instructions. In contrast, “speculative packing” takes place in the

register renaming stage, without certain knowledge of the width of the packed value. Packing

and physical register assignment is performed by predicting the output width of instructions.

The prediction history (per instruction) is kept in the instruction cache. The technique works

well for performance—increases IPC by 15% in the SPEC2000—but may not offer significant

advantages for power due to its complexity.

A different approach is followed in the work of Rochecouste, Pokam, and Seznec [ 192 ].

What they propose is to design a processor with dedicated narrow width datapaths—a width-

partitioned microarchitecture (WPM). This is a work steering technique for this type of excess

activity and is detailed in Section 4.13.

Finally, a scheme to pack multiple compressed instructions to improve instruction fetch

bandwidth and power has been proposed by Hines, Green, Tyson, and Whalley [ 100 ]. But

because this scheme uses Frequent Value Compression, which is explained next, we leave the

details for the end of Section 4.4.

4.4 IDLE-WIDTH SWITCHING ACTIVITY: CACHES

Techniques addressing idle-width activity can be also extended to cache operations (reading

and writing the cache). For example, power can be saved by accessing only the significant or

the compressed part of a word. This results in reading or writing fewer bits, and corresponds to

clock gating unused parts of the ALU or the datapath. Alternatively, multiple cache lines can be

compressed and packed in the space of an uncompressed line. This improves the performance

Search WWH ::

Custom Search

Home