Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

then local scheduling techniques, which operate on a single basic block, can be used. If inding

and exploiting the parallelism require scheduling code across branches, a substantially more

complex global scheduling algorithm must be used. Global scheduling algorithms are not only

more complex in structure, but they also must deal with significantly more complicated trade-

ofs in optimization, since moving code across branches is expensive.

In Appendix H, we will discuss trace scheduling , one of these global scheduling techniques

developed specifically for VLIWs; we will also explore special hardware support that allows

some conditional branches to be eliminated, extending the usefulness of local scheduling and

enhancing the performance of global scheduling.

For now, we will rely on loop unrolling to generate long, straight-line code sequences, so

that we can use local scheduling to build up VLIW instructions and focus on how well these

processors operate.

Example

Suppose we have a VLIW that could issue two memory references, two FP op-

erations, and one integer operation or branch in every clock cycle. Show an un-

rolled version of the loop x[i] = x[i] + s (see page 158 for the MIPS code)

for such a processor. Unroll as many times as necessary to eliminate any stalls.

Ignore delayed branches.

Answer

Figure 3.16 shows the code. The loop has been unrolled to make seven copies

of the body, which eliminates all stalls (i.e., completely empty issue cycles), and

runs in 9 cycles. This code yields a running rate of seven results in 9 cycles, or

1.29 cycles per result, nearly twice as fast as the two-issue superscalar of Section

3.2 that used unrolled and scheduled code.

Search WWH ::

Custom Search

Home