Hardware Reference
In-Depth Information
for the 6-clock-cycle bank busy time. The total time will be 12 + 1 + 6 * 63 = 391
clock cycles, or 6.1 clock cycles per element.
Gather-Scatter: Handling Sparse Matrices In Vector
Architectures
As mentioned above, sparse matrices are commonplace so it is important to have techniques
to allow programs with sparse matrices to execute in vector mode. In a sparse matrix, the ele-
ments of a vector are usually stored in some compacted form and then accessed indirectly.
Assuming a simplified sparse structure, we might see code that looks like this:
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]];
This code implements a sparse vector sum on the arrays A and C , using index vectors K and M
to designate the nonzero elements of A and C . ( A and C must have the same number of nonzero
elements— n of them—so K and M are the same size.)
The primary mechanism for supporting sparse matrices is gather-scater operations using in-
dex vectors. The goal of such operations is to support moving between a compressed repres-
entation (i.e., zeros are not included) and normal representation (i.e., the zeros are included) of
a sparse matrix. A gather operation takes an index vector and fetches the vector whose elements
are at the addresses given by adding a base address to the offsets given in the index vector.
The result is a dense vector in a vector register. After these elements are operated on in dense
form, the sparse vector can be stored in expanded form by a scater store, using the same index
vector. Hardware support for such operations is called gather-scatter and it appears on nearly
all modern vector processors. The VMIPS instructions are LVI (load vector indexed or gather)
and SVI (store vector indexed or scater). For example, if Ra , Rc , Rk , and Rm contain the starting
addresses of the vectors in the previous sequence, we can code the inner loop with vector in-
structions such as:
LV Vk, Rk ;load K
LVI Va, (Ra+Vk) ;load A[K[]]
LV Vm, Rm ;load M
LVI Vc, (Rc+Vm) ;load C[M[]]
ADDVV.D Va, Va, Vc ;add them
SVI (Ra+Vk), Va ;store A[K[]]
This technique allows code with sparse matrices to run in vector mode. A simple vectorizing
compiler could not automatically vectorize the source code above because the compiler would
not know that the elements of K are distinct values, and thus that no dependences exist. In-
stead, a programmer directive would tell the compiler that it was safe to run the loop in vector
mode.
Although indexed loads and stores (gather and scater) can be pipelined, they typically run
much more slowly than non-indexed loads or stores, since the memory banks are not known
at the start of the instruction. Each element has an individual address, so they can't be handled
in groups, and there can be conflicts at many places throughout the memory system. Thus,
each individual access incurs significant latency. However, as Section 4.7 shows, a memory
system can deliver beter performance by designing for this case and by using more hardware
resources versus when architects have a laissez faire atitude toward such accesses.
Search WWH ::




Custom Search