Graphics Reference
In-Depth Information
and ray-triangle are not given here, but outlines can be found in [Knittel00] and
[Bonnedal02], respectively.
Note that data-level parallelism can be utilized even on machines lacking SIMD
extensions. The PopCount8() function in Section 13.4.1 is an example of how to
implement data parallelism within a normal register variable.
13.7.1 Four Spheres Versus Four Spheres SIMD Test
Given eight spheres S i ,1
8, each defined by a 4-tuple ( x i , y i , z i , r i ) — where
( x i , y i , z i ) is the sphere center and r i its radius — the four sphere-sphere tests
i
x j ) 2
y j ) 2
z j ) 2
r j ) 2
( x i
+
( y i
+
( z i
( r i
+
with 1
4, can be performed in parallel using SIMD instruction as
follows. First let these eight SIMD registers be defined, where the notation is such
that the register symbolically called PX contains the four values x 1 , x 2 , x 3 , and x 4 , and
similarly for the remaining seven registers:
i
4, j
=
i
+
PX=x1|x2|x3|x4
QX=x5|x6|x7|x8
PY=y1|y2|y3|y4
QY=y5|y6|y7|y8
PZ=z1|z2|z3|z4
QZ=z5|z6|z7|z8
PR=r1|r2|r3|r4
QR=r5|r6|r7|r8
The four sphere-sphere tests can then be performed in parallel by the following
SIMD assembly pseudocode.
SUB T1,PX,QX
;T1=PX-QX
SUB T2,PY,QY
;T2=PY-QY
SUB T3,PZ,QZ
; T3 = PZ - QZ (T1-3 is difference between sphere centers)
ADD T4,PR,QR
; T4 = PR + QR (T4 is sum of radii)
MUL T1,T1,T1
;T1=T1*T1
MUL T2,T2,T2
;T2=T2*T2
MUL T3,T3,T3
; T3 = T3 * T3 (T1-3 is squared distance between sphere centers)
MUL R2,T4,T4
; R2 = T4 * T4 (R2 is square of sum of radii)
ADD T1,T1,T2
;T1=T1+T2
SUB T2,R2,T3
;T2=R2-T3
LEQ Result,T1,T2
; Result = T1 <= T2
The resulting code is just 11 instructions. With an instruction throughput/latency of
1/4 cycles, the final result is obtained in 16/19 cycles, making the effective throughput
of the code just four cycles for one sphere-sphere test.
 
Search WWH ::




Custom Search