Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

for the 6-clock-cycle bank busy time. The total time will be 12 + 1 + 6 * 63 = 391

clock cycles, or 6.1 clock cycles per element.

Gather-Scatter: Handling Sparse Matrices In Vector

Architectures

As mentioned above, sparse matrices are commonplace so it is important to have techniques

to allow programs with sparse matrices to execute in vector mode. In a sparse matrix, the ele-

ments of a vector are usually stored in some compacted form and then accessed indirectly.

Assuming a simplified sparse structure, we might see code that looks like this:

for (i = 0; i < n; i=i+1)

A[K[i]] = A[K[i]] + C[M[i]];

This code implements a sparse vector sum on the arrays A and C , using index vectors K and M

to designate the nonzero elements of A and C . ( A and C must have the same number of nonzero

elements— n of them—so K and M are the same size.)

The primary mechanism for supporting sparse matrices is gather-scater operations using in-

dex vectors. The goal of such operations is to support moving between a compressed repres-

entation (i.e., zeros are not included) and normal representation (i.e., the zeros are included) of

a sparse matrix. A gather operation takes an index vector and fetches the vector whose elements

are at the addresses given by adding a base address to the offsets given in the index vector.

The result is a dense vector in a vector register. After these elements are operated on in dense

form, the sparse vector can be stored in expanded form by a scater store, using the same index

vector. Hardware support for such operations is called gather-scatter and it appears on nearly

all modern vector processors. The VMIPS instructions are LVI (load vector indexed or gather)

and SVI (store vector indexed or scater). For example, if Ra , Rc , Rk , and Rm contain the starting

addresses of the vectors in the previous sequence, we can code the inner loop with vector in-

structions such as:

LV Vk, Rk ;load K

LVI Va, (Ra+Vk) ;load A[K[]]

LV Vm, Rm ;load M

LVI Vc, (Rc+Vm) ;load C[M[]]

ADDVV.D Va, Va, Vc ;add them

SVI (Ra+Vk), Va ;store A[K[]]

This technique allows code with sparse matrices to run in vector mode. A simple vectorizing

compiler could not automatically vectorize the source code above because the compiler would

not know that the elements of K are distinct values, and thus that no dependences exist. In-

stead, a programmer directive would tell the compiler that it was safe to run the loop in vector

mode.

Although indexed loads and stores (gather and scater) can be pipelined, they typically run

much more slowly than non-indexed loads or stores, since the memory banks are not known

at the start of the instruction. Each element has an individual address, so they can't be handled

in groups, and there can be conflicts at many places throughout the memory system. Thus,

each individual access incurs significant latency. However, as Section 4.7 shows, a memory

system can deliver beter performance by designing for this case and by using more hardware

resources versus when architects have a laissez faire atitude toward such accesses.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home