Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

*(tiPR[AA]*clR[A] + tiPR[AC]*clR[C] + tiPR[AG]*clR[G] + tiPR[AT]*clR[T]);

clP[h++] = (tiPL[CA]*clL[A] + tiPL[CC]*clL[C] + tiPL[CG]*clL[G] + tiPL[CT]*clL[T])

*(tiPR[CA]*clR[A] + tiPR[CC]*clR[C] + tiPR[CG]*clR[G] + tiPR[CT]*clR[T]);

clP[h++] = (tiPL[GA]*clL[A] + tiPL[GC]*clL[C] + tiPL[GG]*clL[G] + tiPL[GT]*clL[T])

*(tiPR[GA]*clR[A] + tiPR[GC]*clR[C] + tiPR[GG]*clR[G] + tiPR[GT]*clR[T]);

clP[h++] = (tiPL[TA]*clL[A] + tiPL[TC]*clL[C] + tiPL[TG]*clL[G] + tiPL[TT]*clL[T])

*(tiPR[TA]*clR[A] + tiPR[TC]*clR[C] + tiPR[TG]*clR[G] + tiPR[TT]*clR[T]);

clL += 4;

clR += 4;

tiPL += 16;

tiPR += 16;

}

4.1 [25] <4.2, 4.3> Assume the constants shown in Figure 4.32 . Show the code for MIPS and

VMIPS. Assume we cannot use scater-gather loads or stores. Assume the starting ad-

dresses of tiPL , tiPR , clL , clR , and clP are in RtiPL , RtiPR , RclL , RclR , and RclP , respectively. As-

sume the VMIPS register length is user programmable and can be assigned by seting the

special register VL (e.g., li VL 4). To facilitate vector addition reductions, assume that we

add the following instructions to VMIPS:

SUMR.S Fd, Vs Vector Summation Reduction Single Precision:

This instruction performs a summation reduction on a vector register Vs , writing to the sum

into scalar register Fd .

FIGURE 4.32 Constants and values for the case study .

4.2 [5] <4.2, 4.3> Assuming seq_length == 500 , what is the dynamic instruction count for both

implementations?

4.3 [25] <4.2, 4.3> Assume that the vector reduction instruction is executed on the vector func-

tional unit, similar to a vector add instruction. Show how the code sequence lays out in

convoys assuming a single instance of each vector functional unit. How many chimes will

the code require? How many cycles per FLOP are needed, ignoring vector instruction issue

overhead?

4.4 [15] <4.2, 4.3> Now assume that we can use scater-gather loads and stores ( LVI and SVI ).

Assume that tiPL , tiPR , clL , clR , and clP are arranged consecutively in memory. For example,if

if seq_length==500 , the tiPR array would begin 500 * 4 bytes after the tiPL array. How does this

affect the way you can write the VMIPS code for this kernel? Assume that you can initial-

ize vector registers with integers using the following technique which would, for example,if

initialize vector register V1 with values (0,0,2000,2000):

Search WWH ::

Custom Search

Home