Hardware Reference
In-Depth Information
X1[i] = X[i] + c; /* X renamed to X1 to remove antidependence */
Z[i] = T[i] + c; /* Y renamed to T to remove antidependence */
Y[i] = c āˆ’ T[i];
}
After the loop, the variable X has been renamed X1 . In code that follows the
loop, the compiler can simply replace the name X by X1 . In this case, renam-
ing does not require an actual copy operation, as it can be done by substituting
names or by register allocation. In other cases, however, renaming will require
copying.
Dependence analysis is a critical technology for exploiting parallelism, as well as for the
transformation-like blocking that Chapter 2 covers. For detecting loop-level parallelism, de-
pendence analysis is the basic tool. Effectively compiling programs for vector computers,
SIMD computers, or multiprocessors depends critically on this analysis. The major drawback
of dependence analysis is that it applies only under a limited set of circumstances, namely,
among references within a single loop nest and using affine index functions. Thus, there are
many situations where array-oriented dependence analysis cannot tell us what we want to
know; for example, analyzing accesses done with pointers, rather than with array indices can
be much harder. (This is one reason why FORTRAN is still preferred over C and C++ for many
scientific applications designed for parallel computers.) Similarly, analyzing references across
procedure calls is extremely diicult. Thus, while analysis of code writen in sequential lan-
guages remains important, we also need approaches such as OpenMP and CUDA that write
explicitly parallel loops.
Eliminating Dependent Computations
As mentioned above, one of the most important forms of dependent computations is a recur-
rence. A dot product is a perfect example of a recurrence:
for (i=9999; i>=0; i=iāˆ’1)
sum = sum + x[i] * y[i];
This loop is not parallel because it has a loop-carried dependence on the variable sum. We
can, however, transform it to a set of loops, one of which is completely parallel and the other
that can be partly parallel. The first loop will execute the completely parallel portion of this
loop. It looks like:
for (i=9999; i>=0; i=iāˆ’1)
sum[i] = x[i] * y[i];
Notice that sum has been expanded from a scalar into a vector quantity (a transformation
called scalar expansion ) and that this transformation makes this new loop completely parallel.
When we are done, however, we need to do the reduce step, which sums up the elements of
the vector. It looks like:
for (i=9999; i>=0; i=iāˆ’1)
finalsum = finalsum + sum[i];
Although this loop is not parallel, it has a very specific structure called a reduction . Reduc-
tions are common in linear algebra and, as we shall see in Chapter 6 , they are also a key part of
the primary parallelism primitive MapReduce used in warehouse-scale computers. In general,
Search WWH ::




Custom Search