Information Technology Reference
In-Depth Information
micro-operation,” and used this as an indicator of the memory boundedness indicative of likely
DVFS effectiveness. For each pattern of past behavior stored in a history entry, a different
prediction of next-step behavior can be made. For each next-step prediction, there is a one-
to-one mapping to an appropriate DVFS setting. If the DVFS setting is different from the
current setting, then the V , f are adjusted accordingly. When guided by the GPHT, DVFS
was found to achieve EDP improvements as high as 34% for the highly variable benchmarks
that this approach targets.
3.4 PROGRAM-LEVEL DVFS FOR MULTIPLE-CLOCK DOMAINS
Some of the early architectural work on DVFS actually focused on opportunities within
multiple-clock-domain (MCD) processors. The rationale for MCD processors is that as feature
sizes get smaller, it becomes more difficult and expensive to distribute a global clock signal with
low skew through the processor die. Thus, researchers have explored globally-asynchronous
locally-synchronous (GALS) techniques.
Scaling voltage/frequency independently for each clock domain within a processor can
be done dynamically ( Section 3.4.1) or statically ( Section 3.4.2); both cases aim to exploit slack
in the execution of individual instructions.
Finally, the emerging architectural paradigm for deep sub-micron technologies, the
multi-core paradigm, can be considered as an MCD design where synchronous cores op-
erate asynchronously to each other. DVFS techniques for multi-cores are discussed in
Section 3.4.3.
3.4.1 DVFS for MCD Processors
In GALS approaches, a processor core is divided into synchronous islands, each of which is
then interconnected asynchronously but with added circuitry to avoid metastability. The islands
are typically intended to correspond to different functional units, such as the instruction fetch
unit, the ALUs, the load-store unit, and so forth. A typical division is shown in Figure 3.5.
In early GALS DVFS work, Marculescu and her students considered the performance
and power implications of GALS designs [ 216 , 117 ]. In [ 117 ], they first predicted that going
from a synchronous to a GALS design caused a drop in performance, but that elimination of the
global clock would not single-handedly lead to drastic power reductions. In fact, from a power
perspective, GALS designs are initially less efficient when compared to synchronous architec-
tures. Their potential, however, lies in the flexibility offered by having several independently
controllable clocks. As with other DVFS opportunities, the key lies in finding inter-domain
slack that one can exploit. For example, in some MCD designs, the floating point unit could be
clocked much more slowly than the instruction fetch unit, because its throughput and latency
demands are lower. Iyer and Marculescu's results show that for a GALS processor with five
Search WWH ::




Custom Search