Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

X1[i] = X[i] + c; /* X renamed to X1 to remove antidependence */

Z[i] = T[i] + c; /* Y renamed to T to remove antidependence */

Y[i] = c − T[i];

}

After the loop, the variable X has been renamed X1 . In code that follows the

loop, the compiler can simply replace the name X by X1 . In this case, renam-

ing does not require an actual copy operation, as it can be done by substituting

names or by register allocation. In other cases, however, renaming will require

copying.

Dependence analysis is a critical technology for exploiting parallelism, as well as for the

transformation-like blocking that Chapter 2 covers. For detecting loop-level parallelism, de-

pendence analysis is the basic tool. Effectively compiling programs for vector computers,

SIMD computers, or multiprocessors depends critically on this analysis. The major drawback

of dependence analysis is that it applies only under a limited set of circumstances, namely,

among references within a single loop nest and using affine index functions. Thus, there are

many situations where array-oriented dependence analysis cannot tell us what we want to

know; for example, analyzing accesses done with pointers, rather than with array indices can

be much harder. (This is one reason why FORTRAN is still preferred over C and C++ for many

scientific applications designed for parallel computers.) Similarly, analyzing references across

procedure calls is extremely diicult. Thus, while analysis of code writen in sequential lan-

guages remains important, we also need approaches such as OpenMP and CUDA that write

explicitly parallel loops.

Eliminating Dependent Computations

As mentioned above, one of the most important forms of dependent computations is a recur-

rence. A dot product is a perfect example of a recurrence:

for (i=9999; i>=0; i=i−1)

sum = sum + x[i] * y[i];

This loop is not parallel because it has a loop-carried dependence on the variable sum. We

can, however, transform it to a set of loops, one of which is completely parallel and the other

that can be partly parallel. The first loop will execute the completely parallel portion of this

loop. It looks like:

for (i=9999; i>=0; i=i−1)

sum[i] = x[i] * y[i];

Notice that sum has been expanded from a scalar into a vector quantity (a transformation

called scalar expansion ) and that this transformation makes this new loop completely parallel.

When we are done, however, we need to do the reduce step, which sums up the elements of

the vector. It looks like:

for (i=9999; i>=0; i=i−1)

finalsum = finalsum + sum[i];

Although this loop is not parallel, it has a very specific structure called a reduction . Reduc-

tions are common in linear algebra and, as we shall see in Chapter 6 , they are also a key part of

the primary parallelism primitive MapReduce used in warehouse-scale computers. In general,

Search WWH ::

Custom Search

Home