Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 3.2 Latencies of FP operations used in this chapter . The last column is the num-

ber of intervening clock cycles needed to avoid a stall. These numbers are similar to the aver-

age latencies we would see on an FP unit. The latency of a floating-point load to a store is 0,

since the result of the load can be bypassed without stalling the store. We will continue to as-

sume an integer load latency of 1 and an integer ALU operation latency of 0.

In this subsection, we look at how the compiler can increase the amount of available ILP by

transforming loops. This example serves both to illustrate an important technique as well as to

motivate the more powerful program transformations described in Appendix H. We will rely

on the following code segment, which adds a scalar to a vector:

for (i=999; i>=0; i=i−1)

x[i] = x[i] + s;

We can see that this loop is parallel by noticing that the body of each iteration is independ-

ent. We formalize this notion in Appendix H and describe how we can test whether loop iter-

ations are independent at compile time. First, let's look at the performance of this loop, show-

ing how we can use the parallelism to improve its performance for a MIPS pipeline with the

latencies shown above.

The first step is to translate the above segment to MIPS assembly language. In the following

code segment, R1 is initially the address of the element in the array with the highest address,

and F2 contains the scalar value s . Register R2 is precomputed, so that 8(R2) is the address of the

last element to operate on.

The straightforward MIPS code, not scheduled for the pipeline, looks like this:

Loop: L.D F0,0(R1) ;F0=array element

ADD.D F4,F0,F2 ;add scalar in F2

S.D F4,0(R1) ;store result

DADDUI R1,R1,#-8 ;decrement pointer

;8 bytes (per DW)

BNE R1,R2,Loop ;branch R1!=R2

Let's start by seeing how well this loop will run when it is scheduled on a simple pipeline

for MIPS with the latencies from Figure 3.2 .

Example

Show how the loop would look on MIPS, both scheduled and unscheduled, in-

cluding any stalls or idle clock cycles. Schedule for delays from loating-point

operations, but remember that we are ignoring delayed branches.

Answer

Without any scheduling, the loop will execute as follows, taking nine cycles:

Search WWH ::

Custom Search

Home