Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Example

To give an idea of what multimedia instructions look like, assume we added

256-bit SIMD multimedia instructions to MIPS. We concentrate on loating-

point in this example. We add the suffix “ 4D ” on instructions that operate on

four double-precision operands at once. Like vector architectures, you can think

of a SIMD processor as having lanes, four in this case. MIPS SIMD will reuse

the floating-point registers as operands for 4D instructions, just as double-preci-

sion reused single-precision registers in the original MIPS. This example shows

MIPS SIMD code for the DAXPY loop. Assume that the starting addresses of X

and Y are in Rx and Ry , respectively. Underline the changes to the MIPS code for

SIMD.

Answer

Here is the MIPS code:

L.D F0,a ;load scalar a

MOV F1, F0 ; copy a into F1 for SIMD MUL

MOV F2, F0 ;copy a into F2 for SIMD MUL

MOV F3, F0 ;copy a into F3 for SIMD MUL

DADDIU R4,Rx,# 512 ;last address to load

Loop: L.4D F4 ,0(Rx) ;load X[i] , X[i+1], X[i+2], X[i+3]

MUL.4D F4,F4 ,F0 ;a × X [i] ,a × X[i+1],a × X[i+2],a × X[i+3]

L.4D F8 ,0(Ry) ;load Y[i] , Y[i+1], Y[i+2], Y[i+3]

ADD.4D F8,F8,F4 ;a × X[i]+Y[i] , …, a × X[i+3]+Y[i+3]

S.4D F8 ,0(Rx) ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]

DADDIU Rx,Rx,#32 ;increment index to X

DADDIU Ry,Ry,#32 ;increment index to Y

DSUBU R20,R4,Rx ;compute bound

BNEZ R20,Loop ;check if done

The changes were replacing every MIPS double-precision instruction with

its 4D equivalent, increasing the increment from 8 to 32, and changing the re-

gisters from F2 and F4 to F4 and F8 to get enough space in the register file for four

sequential double-precision operands. So that each SIMD lane would have its

own copy of the scalar a , we copied the value of F0 into registers F1 , F2 , and F3 .

(Real SIMD instruction extensions have an instruction to broadcast a value to all

other registers in a group.) Thus, the multiply does F4*F0 , F5*F1 , F6*F2 , and F7*F3 .

While not as dramatic as the 100× reduction of dynamic instruction bandwidth

of VMIPS, SIMD MIPS does get a 4× reduction: 149 versus 578 instructions ex-

ecuted for MIPS.

Programming Multimedia SIMD Architectures

Given the ad hoc nature of the SIMD multimedia extensions, the easiest way to use these in-

structions has been through libraries or by writing in assembly language.

Recent extensions have become more regular, giving the compiler a more reasonable target.

By borrowing techniques from vectorizing compilers, compilers are starting to produce SIMD

Search WWH ::

Custom Search

Home