Hardware Reference
In-Depth Information
Example
To give an idea of what multimedia instructions look like, assume we added
256-bit SIMD multimedia instructions to MIPS. We concentrate on loating-
point in this example. We add the suffix “
4D
” on instructions that operate on
four double-precision operands at once. Like vector architectures, you can think
of a SIMD processor as having lanes, four in this case. MIPS SIMD will reuse
the floating-point registers as operands for
4D
instructions, just as double-preci-
sion reused single-precision registers in the original MIPS. This example shows
MIPS SIMD code for the
DAXPY
loop. Assume that the starting addresses of
X
and
Y
are in
Rx
and
Ry
, respectively. Underline the changes to the MIPS code for
SIMD.
Answer
Here is the MIPS code:
L.D F0,a ;load scalar a
MOV F1, F0
;
copy a into F1 for SIMD MUL
MOV F2, F0 ;copy a into F2 for SIMD MUL
MOV F3, F0 ;copy a into F3 for SIMD MUL
DADDIU R4,Rx,#
512
;last address to load
Loop:
L.4D F4
,0(Rx) ;load X[i]
, X[i+1], X[i+2], X[i+3]
MUL.4D F4,F4
,F0 ;a × X [i]
,a × X[i+1],a × X[i+2],a × X[i+3]
L.4D F8
,0(Ry) ;load Y[i]
, Y[i+1], Y[i+2], Y[i+3]
ADD.4D F8,F8,F4
;a × X[i]+Y[i]
, …, a × X[i+3]+Y[i+3]
S.4D F8
,0(Rx) ;store into Y[i],
Y[i+1], Y[i+2], Y[i+3]
DADDIU Rx,Rx,#32 ;increment index to X
DADDIU Ry,Ry,#32 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
The changes were replacing every MIPS double-precision instruction with
its
4D
equivalent, increasing the increment from 8 to 32, and changing the re-
gisters from
F2
and
F4
to
F4
and
F8
to get enough space in the register file for four
sequential double-precision operands. So that each SIMD lane would have its
own copy of the scalar
a
, we copied the value of
F0
into registers
F1
,
F2
, and
F3
.
(Real SIMD instruction extensions have an instruction to broadcast a value to all
other registers in a group.) Thus, the multiply does
F4*F0
,
F5*F1
,
F6*F2
, and
F7*F3
.
While not as dramatic as the 100× reduction of dynamic instruction bandwidth
of VMIPS, SIMD MIPS does get a 4× reduction: 149 versus 578 instructions ex-
ecuted for MIPS.
Programming Multimedia SIMD Architectures
Given the ad hoc nature of the SIMD multimedia extensions, the easiest way to use these in-
structions has been through libraries or by writing in assembly language.
Recent extensions have become more regular, giving the compiler a more reasonable target.
By borrowing techniques from vectorizing compilers, compilers are starting to produce SIMD
Search WWH ::
Custom Search