Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

4.5 Detecting and Enhancing Loop-Level Parallelism

Loops in programs are the fountainhead of many of the types of parallelism we discussed

above and in Chapter 5 . In this section, we discuss compiler technology for discovering the

amount of parallelism that we can exploit in a program as well as hardware support for these

compiler techniques. We define precisely when a loop is parallel (or vectorizable), how de-

pendence can prevent a loop from being parallel, and techniques for eliminating some types

of dependences. Finding and manipulating loop-level parallelism is critical to exploiting both

DLP and TLP, as well as the more aggressive static ILP approaches (e.g., VLIW) that we exam-

ine in Appendix H.

Loop-level parallelism is normally analyzed at the source level or close to it, while most ana-

lysis of ILP is done once instructions have been generated by the compiler. Loop-level analysis

involves determining what dependences exist among the operands in a loop across the iter-

ations of that loop. For now, we will consider only data dependences, which arise when an

operand is writen at some point and read at a later point. Name dependences also exist and

may be removed by the renaming techniques discussed in Chapter 3 .

The analysis of loop-level parallelism focuses on determining whether data accesses in later

iterations are dependent on data values produced in earlier iterations; such dependence is

called a loop-carried dependence . Most of the examples we considered in Chapters 2 and 3 had

no loop-carried dependences and, thus, are loop-level parallel. To see that a loop is parallel,

let us first look at the source representation:

for (i=999; i>=0; i=i−1)

x[i] = x[i] + is

In this loop, the two uses of x[i] are dependent, but this dependence is within a single it-

eration and is not loop carried. There is a loop-carried dependence between successive uses

of i in different iterations, but this dependence involves an induction variable that can be eas-

ily recognized and eliminated. We saw examples of how to eliminate dependences involving

induction variables during loop unrolling in Section 2.2 of Chapter 2, and we will look at ad-

ditional examples later in this section.

Because finding loop-level parallelism involves recognizing structures such as loops, array

references, and induction variable computations, the compiler can do this analysis more easily

at or near the source level, as opposed to the machine-code level. Let's look at a more complex

example.

Example

Consider a loop like this one:

for (i=0; i<100; i=i+1) {

A[i+1] = A[i] + C[i]; /* S1 */

B[i+1] = B[i] + A[i+1]; /* S2 */

}

Assume that A , B , and C are distinct, nonoverlapping arrays. (In practice, the

arrays may sometimes be the same or may overlap. Because the arrays may

be passed as parameters to a procedure that includes this loop, determining

whether arrays overlap or are identical often requires sophisticated, interpro-

cedural analysis of the program.) What are the data dependences among the

statements S1 and S2 in the loop?

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home