Using Voltage and Frequency Adjustments toManage Dynamic Power - Computer Architecture Techniques for Power-Efficiency

Information Technology Reference

In-Depth Information

micro-operation,” and used this as an indicator of the memory boundedness indicative of likely

DVFS effectiveness. For each pattern of past behavior stored in a history entry, a different

prediction of next-step behavior can be made. For each next-step prediction, there is a one-

to-one mapping to an appropriate DVFS setting. If the DVFS setting is different from the

current setting, then the V , f are adjusted accordingly. When guided by the GPHT, DVFS

was found to achieve EDP improvements as high as 34% for the highly variable benchmarks

that this approach targets.

3.4 PROGRAM-LEVEL DVFS FOR MULTIPLE-CLOCK DOMAINS

Some of the early architectural work on DVFS actually focused on opportunities within

multiple-clock-domain (MCD) processors. The rationale for MCD processors is that as feature

sizes get smaller, it becomes more difficult and expensive to distribute a global clock signal with

low skew through the processor die. Thus, researchers have explored globally-asynchronous

locally-synchronous (GALS) techniques.

Scaling voltage/frequency independently for each clock domain within a processor can

be done dynamically ( Section 3.4.1) or statically ( Section 3.4.2); both cases aim to exploit slack

in the execution of individual instructions.

Finally, the emerging architectural paradigm for deep sub-micron technologies, the

multi-core paradigm, can be considered as an MCD design where synchronous cores op-

erate asynchronously to each other. DVFS techniques for multi-cores are discussed in

Section 3.4.3.

3.4.1 DVFS for MCD Processors

In GALS approaches, a processor core is divided into synchronous islands, each of which is

then interconnected asynchronously but with added circuitry to avoid metastability. The islands

are typically intended to correspond to different functional units, such as the instruction fetch

unit, the ALUs, the load-store unit, and so forth. A typical division is shown in Figure 3.5.

In early GALS DVFS work, Marculescu and her students considered the performance

and power implications of GALS designs [ 216 , 117 ]. In [ 117 ], they first predicted that going

from a synchronous to a GALS design caused a drop in performance, but that elimination of the

global clock would not single-handedly lead to drastic power reductions. In fact, from a power

perspective, GALS designs are initially less efficient when compared to synchronous architec-

tures. Their potential, however, lies in the flexibility offered by having several independently

controllable clocks. As with other DVFS opportunities, the key lies in finding inter-domain

slack that one can exploit. For example, in some MCD designs, the floating point unit could be

clocked much more slowly than the instruction fetch unit, because its throughput and latency

demands are lower. Iyer and Marculescu's results show that for a GALS processor with five

Search WWH ::

Custom Search

Home