Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 3.16 VLIW instructions that occupy the inner loop and replace

the unrolled sequence . This code takes 9 cycles assuming no branch delay;

normally the branch delay would also need to be scheduled. The issue rate is

23 operations in 9 clock cycles, or 2.5 operations per cycle. The efficiency, the

percentage of available slots that contained an operation, is about 60%. To

achieve this issue rate requires a larger number of registers than MIPS would

normally use in this loop. The VLIW code sequence above requires at least

eight FP registers, while the same code sequence for the base MIPS pro-

cessor can use as few as two FP registers or as many as five when unrolled

and scheduled.

For the original VLIW model, there were both technical and logistical problems that make

the approach less efficient. The technical problems are the increase in code size and the lim-

itations of lockstep operation. Two different elements combine to increase code size substan-

tially for a VLIW. First, generating enough operations in a straight-line code fragment requires

ambitiously unrolling loops (as in earlier examples), thereby increasing code size. Second,

whenever instructions are not full, the unused functional units translate to wasted bits in the

instruction encoding. In Appendix H, we examine software scheduling approaches, such as

software pipelining, that can achieve the benefits of unrolling without as much code expan-

sion.

To combat this code size increase, clever encodings are sometimes used. For example, there

may be only one large immediate field for use by any functional unit. Another technique is

to compress the instructions in main memory and expand them when they are read into the

cache or are decoded. In Appendix H, we show other techniques, as well as document the sig-

niicant code expansion seen on IA-64.

Early VLIWs operated in lockstep; there was no hazard-detection hardware at all. This struc-

ture dictated that a stall in any functional unit pipeline must cause the entire processor to stall,

since all the functional units must be kept synchronized. Although a compiler may be able to

schedule the deterministic functional units to prevent stalls, predicting which data accesses

will encounter a cache stall and scheduling them are very difficult. Hence, caches needed to be

Search WWH ::

Custom Search

Home