Graphics Reference
In-Depth Information
techniques all introduce some overhead during the execution of the compute shader. If it
is possible to perform the same calculations without synchronization, the algorithm will
run faster. It is often easier to design an algorithm with synchronization than without, but
unless the synchronization methods are used to increase efficiency, their use may be detri-
mental to performance.
Sharing Between Threads
The next point may seem contradictory to our previous comments, but in some cases, ex-
plicitly designing synchronization into an algorithm can result in improved performance.
When memory bandwidth or the computational load can be reduced by sharing data be-
tween threads, it is certainly possible to speed up an algorithm's execution time by syn-
chronizing data across multiple threads. The key is to determine when it is appropriate to
do so.
Share loaded memory. One of the primary uses for compute shader programs is in im-
age processing algorithms. Because image-like resources are accessed, it is very natural
to map the compute shader onto an image-processing basis. This also happens to be one
of the areas that can take advantage of the group shared memory to improve performance.
Depending on the algorithm being implemented, image processing is typically bound by
the memory accesses it performs. However, if multiple pixels within a thread group can use
the same sampled values, then it is quite possible to load a small number of values into the
GSM in each thread, followed by using a memory barrier with group synchronization that
can be accessed by all of the threads from that point on. This effectively reduces the num-
ber of device memory accesses that each thread needs to perform and moves the desired
data into the GSM, which is in general faster to access.
However, care must be taken with this approach as well. There is some latency in-
volved with writing to the GSM and then reading from it. If the reduced device memory
bandwidth does not offset the costs of accessing the GSM, then this technique could actu-
ally hurt performance. This topic becomes even less clear when texture caches are taken
into consideration. It is quite possible that the built in texture caches are already performing
sufficient data caching, making it difficult to predict which technique will be faster. This
can also vary by GPU manufacturer, complicating matters even further. A good suggestion
is to write your algorithms in a way that lets you quickly test them in both scenarios, and
the higher-performing method can be chosen appropriately. An even better approach is to
allow your algorithms to test the current platform and dynamically decide which technique
to use.
Share long calculation results. Just as loaded device memory contents can be cached,
threads can also share calculated values in the GSM. This is also difficult to predict if it is
faster to share calculations, or to just perform them independently in each thread. Modern
Search WWH ::




Custom Search