Graphics Reference
In-Depth Information
Not only is this code shorter than the original routine, it is also much more pipeline
friendly, having no branches. Most branches in time-critical code can be removed
through various approaches, such as the one presented here.
13.9 Summary
A majority of the chapters in this topic have talked about the efficient implementation
of data structures and algorithms as they apply to collision detection (as well as many
other fields). However, beyond data structures and algorithms large performance
increases can be had by directly targeting and tuning for the hardware architecture
of the platform for which the collision system is intended. Another way of putting it
is that constant factors matter! This chapter has focused on two major performance
considerations for modern architectures: memory optimization and optimizations
utilizing parallelism.
Because CPUs are so much faster at processing data (and code) than data can
be delivered to the CPU from main memory, small but fast cache memory units sit
between the CPU and memory to help deliver frequently used data to the CPU.
Properly utilizing these caches is important because there can be one to three mag-
nitudes difference in speed between data being in cache versus not being in cache.
Several ways of optimizing for both code and data cache were discussed, from re-
arranging code and data through using cache-aware data structures and algorithms.
It was also discussed how the presence of aliasing in C and C++ code is one of the
major contributors to inhibited optimizations, leaving unnecessary memory accesses
in the compiled code. What can be done to reduce the presence of aliasing was also
explored.
The discussion of optimizations utilizing parallelism focused on SIMD exten-
sions. Most CPUs today feature SIMD extensions, whereby instructions can operate
on 4, 8, or even 16 data items in parallel. Leveraging these instructions for, say,
primitive-primitive intersection tests can result in close to 4-, 8-, or 16-fold increases
in performance for such tests.
The third and last performance consideration discussed here was branching. With
some CPU pipelines being over 20 stages long, mispredicted branches become very
expensive. Performance-critical code should avoid branches in inner loops, which
can be accomplished through various bit tricks.
Search WWH ::




Custom Search