Hardware Reference
In-Depth Information
FIGURE 3.2 Latencies of FP operations used in this chapter . The last column is the num-
ber of intervening clock cycles needed to avoid a stall. These numbers are similar to the aver-
age latencies we would see on an FP unit. The latency of a floating-point load to a store is 0,
since the result of the load can be bypassed without stalling the store. We will continue to as-
sume an integer load latency of 1 and an integer ALU operation latency of 0.
In this subsection, we look at how the compiler can increase the amount of available ILP by
transforming loops. This example serves both to illustrate an important technique as well as to
motivate the more powerful program transformations described in Appendix H. We will rely
on the following code segment, which adds a scalar to a vector:
for (i=999; i>=0; i=i−1)
x[i] = x[i] + s;
We can see that this loop is parallel by noticing that the body of each iteration is independ-
ent. We formalize this notion in Appendix H and describe how we can test whether loop iter-
ations are independent at compile time. First, let's look at the performance of this loop, show-
ing how we can use the parallelism to improve its performance for a MIPS pipeline with the
latencies shown above.
The first step is to translate the above segment to MIPS assembly language. In the following
code segment, R1 is initially the address of the element in the array with the highest address,
and F2 contains the scalar value s . Register R2 is precomputed, so that 8(R2) is the address of the
last element to operate on.
The straightforward MIPS code, not scheduled for the pipeline, looks like this:
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer
;8 bytes (per DW)
BNE R1,R2,Loop ;branch R1!=R2
Let's start by seeing how well this loop will run when it is scheduled on a simple pipeline
for MIPS with the latencies from Figure 3.2 .
Example
Show how the loop would look on MIPS, both scheduled and unscheduled, in-
cluding any stalls or idle clock cycles. Schedule for delays from loating-point
operations, but remember that we are ignoring delayed branches.
Answer
Without any scheduling, the loop will execute as follows, taking nine cycles:
 
Search WWH ::




Custom Search