Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Vector-Length Registers: Handling Loops Not Equal To 64

A vector register processor has a natural vector length determined by the number of elements

in each vector register. This length, which is 64 for VMIPS, is unlikely to match the real vector

length in a program. Moreover, in a real program the length of a particular vector operation

is often unknown at compile time. In fact, a single piece of code may require different vector

lengths. For example, consider this code:

for (i=0; i <n; i=i+1)

Y[i] = a * X[i] + Y[i];

The size of all the vector operations depends on n , which may not even be known until run

time! The value of n might also be a parameter to a procedure containing the above loop and

therefore subject to change during execution.

The solution to these problems is to create a vector-length register (VLR). The VLR controls

the length of any vector operation, including a vector load or store. The value in the VLR,

however, cannot be greater than the length of the vector registers. This solves our problem

as long as the real length is less than or equal to the maximum vector length (MVL). The MVL

determines the number of data elements in a vector of an architecture. This parameter means

the length of vector registers can grow in later computer generations without changing the

instruction set; as we shall see in the next section, multimedia SIMD extensions have no equi-

valent of MVL, so they change the instruction set every time they increase their vector length.

What if the value of n is not known at compile time and thus may be greater than the MVL?

To tackle the second problem where the vector is longer than the maximum length, a tech-

nique called strip mining is used. Strip mining is the generation of code such that each vector

operation is done for a size less than or equal to the MVL. We create one loop that handles any

number of iterations that is a multiple of the MVL and another loop that handles any remain-

ing iterations and must be less than the MVL. In practice, compilers usually create a single

strip-mined loop that is parameterized to handle both portions by changing the length. We

show the strip-mined version of the DAXPY loop in C:

low = 0;

VL = (n % MVL); /*find odd-size piece using modulo op % */

for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/

for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/

Y[i] = a * X[i] + Y[i] ; /*main operation*/

low = low + VL; /*start of next vector*/

VL = MVL; /*reset the length to maximum vector length*/

}

The term n/MVL represents truncating integer division. The effect of this loop is to block the

vector into segments that are then processed by the inner loop. The length of the first segment

is (n % MVL) , and all subsequent segments are of length MVL . Figure 4.6 shows how to split the

long vector into segments.

Search WWH ::

Custom Search

Home