Two-Level Constraint Solver and Pipelined Local Batching for Rigid Body Simulation on GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

more than twice as fast as the estimate. Because ALU operation time cannot

be shortened, the only reason for superior performance is that the two-level con-

straint solver has faster memory access. This is possible only because of better

utilization of the cache hierarchy.

Cache utilization is low for the global constraint solver because a rigid body is

processed only once in a kernel and the assignment of a constraint to a SIMD is

random for each kernel. Therefore, it cannot reuse any cached data from previous

kernel executions. In contrast, localized constraint solving and in-SIMD batch

dispatch of the two-level solver enable it to use cache eciently. When a rigid

body has multiple constraints in the region, it accesses the same body multiple

times. Also, in-SIMD batch dispatch keeps a SIMD running until all the batches

are processed; therefore, it is guaranteed that all batches of a constraint group

are always solved by the same SIMD. This means that cached data remains in

a SIMD, whereas the global constraint solver can assign any constraint from the

entire simulation domain to a SIMD, which likely trashes the data in the cache

for every batch solve.

Another advantage of in-SIMD dispatch is that it reduces GPU processing

overhead by reducing the number of kernel dispatches. For scene 1, the global

constraint solver executes 11 kernels (Table 4.1) while the two-level constraint

solver executes four kernels to solve the system once. The two-level constraint

solver has an advantage on the overhead as well.

From these analyses, we found that higher ALU occupancy is not the most

important factor for achieving high performance of a constraint solver for a rigid

body simulation on the GPU. To improve performance further, we need to reduce

the memory trac using optimizations such as memory compression and cache-

aware ordering of constraint data.

The constraint solver using local batching is a persistent thread style imple-

mentation because a SIMD keeps processing until all constraints in a constraint

group are solved [Aila and Laine 09]. They chose this implementation to improve

the occupancy of the GPU, but we found that it has another positive impact:

performance improvement because of better cache utilization.

Dispatching small kernels like the global constraint solver is simple to im-

plement and worked well on old GPU architectures that do not have a memory

hierarchy. However, today's GPUs have evolved and are equipped with a cache

hierarchy. Our study has shown that the old GPU programming style, in which

small kernels are dispatched frequently, cannot exploit current GPU architec-

tures. Thus, a persistent thread style implementation is preferable for today's

GPUs. An alternative solution would be to provide an API to choose a SIMD to

run a computation so the GPU can benefit from the cache from different kernel

executions.

This solver has been integrated to the Bullet 3 physics simulation library and

it is used as a basis for the GPU rigid body simulation solver. Full source code

is available at [Coumans 13].

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home