Hardware Reference
In-Depth Information
blocking to change the computation so that values could be kept in the vector registers. This
approach lowered the number of memory references per FLOP and improved the perform-
ance by nearly a factor of two! Thus, the memory bandwidth on the Cray-1 became suicient
for a loop that formerly required more bandwidth.
Fallacy On GPUs, Just Add More Threads If You Don't Have Enough Memory
Performance
GPUs use many CUDA threads to hide the latency to main memory. If memory accesses are
scatered or not correlated among CUDA threads, the memory system will get progressively
slower in responding to each individual request. Eventually, even many threads will not cov-
er the latency. For the “more CUDA threads” strategy to work, not only do you need lots of
CUDA Threads, but the CUDA threads themselves also must be well behaved in terms of loc-
ality of memory accesses.
4.9 Concluding Remarks
Data-level parallelism is increasing in importance for personal mobile devices, given the pop-
ularity of applications showing the importance of audio, video, and games on these devices.
When combined with an easier to program model than task-level parallelism and potentially
better energy efficiency, it's easy to predict a renaissance for data-level parallelism in this next
decade. Indeed, we can already see this emphasis in products, as both GPUs and traditional
processors have been increasing the number of SIMD lanes at least as fast as they have been
adding processors (see Figure 4.1 on page 263).
Hence, we are seeing system processors take on more of the characteristics of GPUs, and
vice versa. One of the biggest differences in performance between conventional processors
and GPUs has been for gather-scater addressing. Traditional vector architectures show how
to add such addressing to SIMD instructions, and we expect to see more ideas added from the
well-proven vector architectures to SIMD extensions over time.
As we said at the opening of Section 4.4 , the GPU question is not simply which architecture
is best, but, given the hardware investment to do graphics well, how can it be enhanced to sup-
port computation that is more general? Although vector architectures have many advantages
on paper, it remains to be proven whether vector architectures can be as good a foundation for
graphics as GPUs.
GPU SIMD processors and compilers are still of relatively simple design. Techniques that
are more aggressive will likely be introduced over time to increase GPU utilization, especially
since GPU computing applications are just starting to be developed. By studying these new
programs, GPU designers will surely discover and implement new machine optimizations.
One question is whether the scalar processor (or control processor), which serves to save hard-
ware and energy in vector processors, will appear within GPUs.
The Fermi architecture has already included many features found in conventional pro-
cessors to make GPUs more mainstream, but there are still others necessary to close the gap.
Here are a few we expect to be addressed in the near future.
Virtualizable GPUs . Virtualization has proved important for servers and is the foundation of
cloud computing (see Chapter 6 ). For GPUs to be included in the cloud, they will need to
be just as virtualizable as the processors and memory that they are atached to.
 
Search WWH ::




Custom Search