Hardware Reference
In-Depth Information
FIGURE 3.16 VLIW instructions that occupy the inner loop and replace
the unrolled sequence . This code takes 9 cycles assuming no branch delay;
normally the branch delay would also need to be scheduled. The issue rate is
23 operations in 9 clock cycles, or 2.5 operations per cycle. The efficiency, the
percentage of available slots that contained an operation, is about 60%. To
achieve this issue rate requires a larger number of registers than MIPS would
normally use in this loop. The VLIW code sequence above requires at least
eight FP registers, while the same code sequence for the base MIPS pro-
cessor can use as few as two FP registers or as many as five when unrolled
and scheduled.
For the original VLIW model, there were both technical and logistical problems that make
the approach less efficient. The technical problems are the increase in code size and the lim-
itations of lockstep operation. Two different elements combine to increase code size substan-
tially for a VLIW. First, generating enough operations in a straight-line code fragment requires
ambitiously unrolling loops (as in earlier examples), thereby increasing code size. Second,
whenever instructions are not full, the unused functional units translate to wasted bits in the
instruction encoding. In Appendix H, we examine software scheduling approaches, such as
software pipelining, that can achieve the benefits of unrolling without as much code expan-
sion.
To combat this code size increase, clever encodings are sometimes used. For example, there
may be only one large immediate field for use by any functional unit. Another technique is
to compress the instructions in main memory and expand them when they are read into the
cache or are decoded. In Appendix H, we show other techniques, as well as document the sig-
niicant code expansion seen on IA-64.
Early VLIWs operated in lockstep; there was no hazard-detection hardware at all. This struc-
ture dictated that a stall in any functional unit pipeline must cause the entire processor to stall,
since all the functional units must be kept synchronized. Although a compiler may be able to
schedule the deterministic functional units to prevent stalls, predicting which data accesses
will encounter a cache stall and scheduling them are very difficult. Hence, caches needed to be
 
Search WWH ::




Custom Search