Hardware Reference
In-Depth Information
Example
To give an idea of what multimedia instructions look like, assume we added
256-bit SIMD multimedia instructions to MIPS. We concentrate on loating-
point in this example. We add the suffix “ 4D ” on instructions that operate on
four double-precision operands at once. Like vector architectures, you can think
of a SIMD processor as having lanes, four in this case. MIPS SIMD will reuse
the floating-point registers as operands for 4D instructions, just as double-preci-
sion reused single-precision registers in the original MIPS. This example shows
MIPS SIMD code for the DAXPY loop. Assume that the starting addresses of X
and Y are in Rx and Ry , respectively. Underline the changes to the MIPS code for
SIMD.
Answer
Here is the MIPS code:
L.D F0,a ;load scalar a
MOV F1, F0 ; copy a into F1 for SIMD MUL
MOV F2, F0 ;copy a into F2 for SIMD MUL
MOV F3, F0 ;copy a into F3 for SIMD MUL
DADDIU R4,Rx,# 512 ;last address to load
Loop: L.4D F4 ,0(Rx) ;load X[i] , X[i+1], X[i+2], X[i+3]
MUL.4D F4,F4 ,F0 ;a × X [i] ,a × X[i+1],a × X[i+2],a × X[i+3]
L.4D F8 ,0(Ry) ;load Y[i] , Y[i+1], Y[i+2], Y[i+3]
ADD.4D F8,F8,F4 ;a × X[i]+Y[i] , …, a × X[i+3]+Y[i+3]
S.4D F8 ,0(Rx) ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
DADDIU Rx,Rx,#32 ;increment index to X
DADDIU Ry,Ry,#32 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
The changes were replacing every MIPS double-precision instruction with
its 4D equivalent, increasing the increment from 8 to 32, and changing the re-
gisters from F2 and F4 to F4 and F8 to get enough space in the register file for four
sequential double-precision operands. So that each SIMD lane would have its
own copy of the scalar a , we copied the value of F0 into registers F1 , F2 , and F3 .
(Real SIMD instruction extensions have an instruction to broadcast a value to all
other registers in a group.) Thus, the multiply does F4*F0 , F5*F1 , F6*F2 , and F7*F3 .
While not as dramatic as the 100× reduction of dynamic instruction bandwidth
of VMIPS, SIMD MIPS does get a 4× reduction: 149 versus 578 instructions ex-
ecuted for MIPS.
Programming Multimedia SIMD Architectures
Given the ad hoc nature of the SIMD multimedia extensions, the easiest way to use these in-
structions has been through libraries or by writing in assembly language.
Recent extensions have become more regular, giving the compiler a more reasonable target.
By borrowing techniques from vectorizing compilers, compilers are starting to produce SIMD
Search WWH ::




Custom Search