Hardware Reference
In-Depth Information
then local scheduling techniques, which operate on a single basic block, can be used. If inding
and exploiting the parallelism require scheduling code across branches, a substantially more
complex global scheduling algorithm must be used. Global scheduling algorithms are not only
more complex in structure, but they also must deal with significantly more complicated trade-
ofs in optimization, since moving code across branches is expensive.
In Appendix H, we will discuss trace scheduling , one of these global scheduling techniques
developed specifically for VLIWs; we will also explore special hardware support that allows
some conditional branches to be eliminated, extending the usefulness of local scheduling and
enhancing the performance of global scheduling.
For now, we will rely on loop unrolling to generate long, straight-line code sequences, so
that we can use local scheduling to build up VLIW instructions and focus on how well these
processors operate.
Example
Suppose we have a VLIW that could issue two memory references, two FP op-
erations, and one integer operation or branch in every clock cycle. Show an un-
rolled version of the loop x[i] = x[i] + s (see page 158 for the MIPS code)
for such a processor. Unroll as many times as necessary to eliminate any stalls.
Ignore delayed branches.
Answer
Figure 3.16 shows the code. The loop has been unrolled to make seven copies
of the body, which eliminates all stalls (i.e., completely empty issue cycles), and
runs in 9 cycles. This code yields a running rate of seven results in 9 cycles, or
1.29 cycles per result, nearly twice as fast as the two-issue superscalar of Section
3.2 that used unrolled and scheduled code.
Search WWH ::




Custom Search