Graphics Reference
In-Depth Information
more than twice as fast as the estimate. Because ALU operation time cannot
be shortened, the only reason for superior performance is that the two-level con-
straint solver has faster memory access. This is possible only because of better
utilization of the cache hierarchy.
Cache utilization is low for the global constraint solver because a rigid body is
processed only once in a kernel and the assignment of a constraint to a SIMD is
random for each kernel. Therefore, it cannot reuse any cached data from previous
kernel executions. In contrast, localized constraint solving and in-SIMD batch
dispatch of the two-level solver enable it to use cache eciently. When a rigid
body has multiple constraints in the region, it accesses the same body multiple
times. Also, in-SIMD batch dispatch keeps a SIMD running until all the batches
are processed; therefore, it is guaranteed that all batches of a constraint group
are always solved by the same SIMD. This means that cached data remains in
a SIMD, whereas the global constraint solver can assign any constraint from the
entire simulation domain to a SIMD, which likely trashes the data in the cache
for every batch solve.
Another advantage of in-SIMD dispatch is that it reduces GPU processing
overhead by reducing the number of kernel dispatches. For scene 1, the global
constraint solver executes 11 kernels (Table 4.1) while the two-level constraint
solver executes four kernels to solve the system once. The two-level constraint
solver has an advantage on the overhead as well.
From these analyses, we found that higher ALU occupancy is not the most
important factor for achieving high performance of a constraint solver for a rigid
body simulation on the GPU. To improve performance further, we need to reduce
the memory trac using optimizations such as memory compression and cache-
aware ordering of constraint data.
The constraint solver using local batching is a persistent thread style imple-
mentation because a SIMD keeps processing until all constraints in a constraint
group are solved [Aila and Laine 09]. They chose this implementation to improve
the occupancy of the GPU, but we found that it has another positive impact:
performance improvement because of better cache utilization.
Dispatching small kernels like the global constraint solver is simple to im-
plement and worked well on old GPU architectures that do not have a memory
hierarchy. However, today's GPUs have evolved and are equipped with a cache
hierarchy. Our study has shown that the old GPU programming style, in which
small kernels are dispatched frequently, cannot exploit current GPU architec-
tures. Thus, a persistent thread style implementation is preferable for today's
GPUs. An alternative solution would be to provide an API to choose a SIMD to
run a computation so the GPU can benefit from the cache from different kernel
executions.
This solver has been integrated to the Bullet 3 physics simulation library and
it is used as a basis for the GPU rigid body simulation solver. Full source code
is available at [Coumans 13].
Search WWH ::




Custom Search