Hardware Reference
In-Depth Information
FIGURE 4.4 Using multiple functional units to improve the performance of a single
vector add instruction, C = A + B . The vector processor (a) on the left has a single add
pipeline and can complete one addition per cycle. The vector processor (b) on the right has
four add pipelines and can complete four additions per cycle. The elements within a single
vector add instruction are interleaved across the four pipelines. The set of elements that move
through the pipelines together is termed an element group . (Reproduced with permission from
Asanovic [1998] . )
The VMIPS instruction set has the property that all vector arithmetic instructions only allow
element N of one vector register to take part in operations with element N from other vector
registers. This dramatically simplifies the construction of a highly parallel vector unit, which
can be structured as multiple parallel lanes . As with a traffic highway, we can increase the peak
throughput of a vector unit by adding more lanes. Figure 4.5 shows the structure of a four-lane
vector unit. Thus, going to four lanes from one lane reduces the number of clocks for a chime
from 64 to 16. For multiple lanes to be advantageous, both the applications and the architec-
ture must support long vectors; otherwise, they will execute so quickly that you'll run out of
instruction bandwidth, requiring ILP techniques (see Chapter 3 ) to supply enough vector in-
structions.
 
Search WWH ::




Custom Search