Hardware Reference
In-Depth Information
memory is much smaller in GPUs. While GPUs support memory protection at the page level,
they do not support demand paging.
In addition to the large numerical differences in processors, SIMD lanes, hardware thread
support, and cache sizes, there are many architectural differences. The scalar processor and
Multimedia SIMD instructions are tightly integrated in traditional computers; they are separ-
ated by an I/O bus in GPUs, and they even have separate main memories. The multiple SIMD
processors in a GPU use a single address space, but the caches are not coherent as they are in
traditional multicore computers. Unlike GPUs, multimedia SIMD instructions do not support
gather-scater memory accesses, which Section 4.7 shows is a significant omission.
Summary
Now that the veil has been lifted, we can see that GPUs are really just multithreaded SIMD
processors, although they have more processors, more lanes per processor, and more mul-
tithreading hardware than do traditional multicore computers. For example, the Fermi GTX
480 has 15 SIMD processors with 16 lanes per processor and hardware support for 32 SIMD
threads. Fermi even embraces instruction-level parallelism by issuing instructions from two
SIMD threads to two sets of SIMD lanes. They also have less cache memory—Fermi's L2 cache
is 0.75 megabyte—and it is not coherent with the distant scalar processor.
The CUDA programming model wraps up all these forms of parallelism around a single ab-
straction, the CUDA Thread. Thus, the CUDA programmer can think of programming thou-
sands of threads, although they are really executing each block of 32 threads on the many lanes
of the many SIMD Processors. The CUDA programmer who wants good performance keeps
in mind that these threads are blocked and executed 32 at a time and that addresses need to
be to adjacent addresses to get good performance from the memory system.
Although we've used CUDA and the NVIDIA GPU in this section, rest assured that the
same ideas are found in the OpenCL programming language and in GPUs from other com-
panies.
Now that you understand beter how GPUs work, we reveal the real jargon. Figures 4.24
and 4.25 match the descriptive terms and definitions of this section with the official CUDA/
NVIDIA and AMD terms and definitions. We also include the OpenCL terms. We believe
the GPU learning curve is steep in part because of using terms such as “Streaming Mul-
tiprocessor” for the SIMD Processor, “Thread Processor” for the SIMD Lane, and “Shared
Memory” for Local Memory—especially since Local Memory is not shared between SIMD Pro-
cessors! We hope that this two-step approach gets you up that curve quicker, even if it's a bit
indirect.
Search WWH ::




Custom Search