Java Reference
In-Depth Information
1. lw
$10,a
6.
add
$10,$13,$12
2. lw
$11,b
7.
mul
$11,$11,$10
3. lw
$12,d
8.
mul
$12,$10,$12
4. lw
$13,c
9.
add
$12,$11,$12
5. mul $11,$10,$11
10. sw
$12,a
Figure 13.25: Delay-free MIPS code for a=((a*b)*(c+d))+(d*(c+d))
instruction 4 uses a previous value of $10, loaded in instruction 1. If we add an
additional register, $13, to our allocation, we can load it in instruction 5 (taking
care to reference $13 rather than $10 in instruction 6). Now instruction 5 can
be moved earlier in the sequence, avoiding a stall. The resulting delay-free
code is shown in Figure 13.25.
It is evident that there is a tension between code scheduling, which tries
to increase the number of registers used (to avoid stalls), and code genera-
tion, which seeks to reduce the number of registers used (to avoid spills and
make registers available for other purposes). An alternative to postpass code
scheduling is an integrated approach that intermixes register allocation and
code scheduling.
The Goodman-Hsu [GH88] algorithm is a well-known integrated register
allocator and code scheduler . As long as registers are available, it uses them
to improve code scheduling by loading needed values into distinct registers.
This allows loads to “float” to the beginning of the code sequence, eliminating
stalls in later instructions that use the loaded values. When registers grow
scarce, the algorithm switches emphasis and begins to schedule code to free
registers. When su
cient registers are available, it resumes scheduling to
avoid stalls. Experience has shown that this approach balances nicely the
need to use registers sparingly and yet avoid stalls whenever possible.
13.4.2 Global and Dynamic Code Scheduling
Although we have focused on code scheduling at the basic block level, re-
searchers have also studied global code scheduling [BR91]. Instructions may
be moved upward, past the beginning of a basic block, to predecessor blocks in
the control flow graph. We may need to move instructions out of a basic block
because basic blocks are often very small, sometimes only an instruction or two
in size. Moreover, certain instructions, like loads and floating point multiplies
and divides, can incur long latencies. For example, a load that misses in the
primary cache may stall for 10 or more cycles; a miss in the secondary cache
(to main memory) can cost 100 or more cycles.
 
 
Search WWH ::




Custom Search