The Computation Pipeline - Practical Rendering and Computation with Direct3D 11

Graphics Reference

In-Depth Information

techniques all introduce some overhead during the execution of the compute shader. If it

is possible to perform the same calculations without synchronization, the algorithm will

run faster. It is often easier to design an algorithm with synchronization than without, but

unless the synchronization methods are used to increase efficiency, their use may be detri-

mental to performance.

Sharing Between Threads

The next point may seem contradictory to our previous comments, but in some cases, ex-

plicitly designing synchronization into an algorithm can result in improved performance.

When memory bandwidth or the computational load can be reduced by sharing data be-

tween threads, it is certainly possible to speed up an algorithm's execution time by syn-

chronizing data across multiple threads. The key is to determine when it is appropriate to

do so.

Share loaded memory. One of the primary uses for compute shader programs is in im-

age processing algorithms. Because image-like resources are accessed, it is very natural

to map the compute shader onto an image-processing basis. This also happens to be one

of the areas that can take advantage of the group shared memory to improve performance.

Depending on the algorithm being implemented, image processing is typically bound by

the memory accesses it performs. However, if multiple pixels within a thread group can use

the same sampled values, then it is quite possible to load a small number of values into the

GSM in each thread, followed by using a memory barrier with group synchronization that

can be accessed by all of the threads from that point on. This effectively reduces the num-

ber of device memory accesses that each thread needs to perform and moves the desired

data into the GSM, which is in general faster to access.

However, care must be taken with this approach as well. There is some latency in-

volved with writing to the GSM and then reading from it. If the reduced device memory

bandwidth does not offset the costs of accessing the GSM, then this technique could actu-

ally hurt performance. This topic becomes even less clear when texture caches are taken

into consideration. It is quite possible that the built in texture caches are already performing

sufficient data caching, making it difficult to predict which technique will be faster. This

can also vary by GPU manufacturer, complicating matters even further. A good suggestion

is to write your algorithms in a way that lets you quickly test them in both scenarios, and

the higher-performing method can be chosen appropriately. An even better approach is to

allow your algorithms to test the current platform and dynamically decide which technique

to use.

Share long calculation results. Just as loaded device memory contents can be cached,

threads can also share calculated values in the GSM. This is also difficult to predict if it is

faster to share calculations, or to just perform them independently in each thread. Modern

Search WWH ::

Custom Search

Home