Optimization - Real-Time Collision Detection

Graphics Reference

In-Depth Information

Not only is this code shorter than the original routine, it is also much more pipeline

friendly, having no branches. Most branches in time-critical code can be removed

through various approaches, such as the one presented here.

13.9 Summary

A majority of the chapters in this topic have talked about the efficient implementation

of data structures and algorithms as they apply to collision detection (as well as many

other fields). However, beyond data structures and algorithms large performance

increases can be had by directly targeting and tuning for the hardware architecture

of the platform for which the collision system is intended. Another way of putting it

is that constant factors matter! This chapter has focused on two major performance

considerations for modern architectures: memory optimization and optimizations

utilizing parallelism.

Because CPUs are so much faster at processing data (and code) than data can

be delivered to the CPU from main memory, small but fast cache memory units sit

between the CPU and memory to help deliver frequently used data to the CPU.

Properly utilizing these caches is important because there can be one to three mag-

nitudes difference in speed between data being in cache versus not being in cache.

Several ways of optimizing for both code and data cache were discussed, from re-

arranging code and data through using cache-aware data structures and algorithms.

It was also discussed how the presence of aliasing in C and C++ code is one of the

major contributors to inhibited optimizations, leaving unnecessary memory accesses

in the compiled code. What can be done to reduce the presence of aliasing was also

explored.

The discussion of optimizations utilizing parallelism focused on SIMD exten-

sions. Most CPUs today feature SIMD extensions, whereby instructions can operate

on 4, 8, or even 16 data items in parallel. Leveraging these instructions for, say,

primitive-primitive intersection tests can result in close to 4-, 8-, or 16-fold increases

in performance for such tests.

The third and last performance consideration discussed here was branching. With

some CPU pipelines being over 20 stages long, mispredicted branches become very

expensive. Performance-critical code should avoid branches in inner loops, which

can be accomplished through various bit tricks.

Search WWH ::

Custom Search

Home