Graphics Reference
In-Depth Information
Currently available SIMD extensions include SSE and SSE2 from Intel, 3DNow!
from AMD, and AltiVec from Motorola and IBM. Sony's PlayStation 2 takes parallelism
to a new level by offering both multimedia extensions on its MIPS CPU core (featuring
two ALUs and one FPU) and two separate vector co-processors running parallel to
the main CPU, both with parallel instruction streams. Overall, the PlayStation 2 is
capable of executing up to seven different instructions at the same time! New game
consoles like the PlayStation 3 and the Xbox 2 will offer even more parallelism. The
following will focus on the use of SIMD instructions, which can be used to optimize
sequential code in two different ways.
Instruction-level parallelism. Existing algorithms can be optimized by identifying
similar operations and computing these in parallel. For example, the multipli-
cations in a dot product could be computed in parallel using a single SIMD
multiplication. Because there are serial aspects to most computations, some
parts of a process cannot be optimized using instruction-level parallelism, such
as the summation of the partial products within the dot product.
Data-level parallelism. SIMD can also be used to operate on completely differ-
ent data in parallel. For example, a typical use could be to compute four dot
products in parallel. Data-level parallelism is the natural operating use of SIMD
instructions, as the SIMD name implies.
An example of a successful application of SIMD optimization is the acceleration
of ray intersections in an interactive ray-tracing application, by testing four rays in
parallel against a triangle using SIMD instructions [Wald01]. To feed the test, four
rays (arranged in a 2
2 cluster) are simultaneously traversed through a k -d tree.
The traversal decision is also made in parallel using SIMD instructions. A subtree is
visited if at least one ray intersects its defining volume. Although some overhead is
caused by this traversal decision, because the 2
×
2 cluster of rays is highly coherent
little extra work is performed in reality. However, this query clustering is not as
applicable to collision detection in general, in which a group of ray queries tend
to span a much wider field. For the SIMD triangle test, a 3.5 to 3.7 time speedup
compared to the non-SIMD C code was reported. The clustered traversal of four rays
in parallel provided an additional speedup of about 2.
Compilers in general remain rather underdeveloped in their support of SIMD
instructions. The support largely consists of providing intrinsic functions (built-in
functions corresponding more or less one-to-one to the assembly instructions).
Effective use of SIMD instructions still largely relies on hand coding in assembly
language.
To illustrate how effective the use of SIMD instructions can be in collision detection
tests, the next three sections outline possible implementations of four simultaneous
sphere-sphere, sphere-AABB, and AABB-AABB tests. It is here assumed the SIMD
architecture can hold four (floating-point) values in the same register and operate
on them simultaneously in a single instruction. SIMD intersection tests for ray-box
×
Search WWH ::




Custom Search