Hardware Reference
In-Depth Information
Caches for GPU Memory —While the GPU philosophy is to have enough threads to hide
DRAM latency, there are variables that are needed across threads, such as local variables
mentioned above. Fermi includes both an L1 Data Cache and L1 Instruction Cache for each
multithreaded SIMD Processor and a single 768 KB L2 cache shared by all multithreaded
SIMD Processors in the GPU. As mentioned above, in addition to reducing bandwidth
pressure on GPU Memory, caches can save energy by staying on-chip rather than going
of-chip to DRAM. The L1 cache actually cohabits the same SRAM as Local Memory. Fermi
has a mode bit that offers the choice of using 64 KB of SRAM as a 16 KB L1 cache with 48
KB of Local Memory or as a 48 KB L1 cache with 16 KB of Local Memory. Note that the
GTX 480 has an inverted memory hierarchy: The size of the aggregate register file is 2 MB,
the size of all the L1 data caches is between 0.25 and 0.75 MB (depending on whether they
are 16 KB or 48 KB), and the size of the L2 cache is 0.75 MB. It will be interesting to see the
impact of this inverted ratio on GPU applications.
64-Bit Addressing and a Unified Address Space for All GPU Memories —This innovation makes
it much easier to provide the pointers needed for C and C++.
Error Correcting Codes to detect and correct errors in memory and registers (see Chapter
2 ) —To make long-running applications dependable on thousands of servers, ECC is the
norm in the datacenter (see Chapter 6 ) .
Faster Context Switching —Given the large state of a multithreaded SIMD Processor, Fermi
has hardware support to switch contexts much more quickly. Fermi can switch in less than
25 microseconds, about 10× faster than its predecessor can.
Faster Atomic Instructions —First included in the Tesla architecture, Fermi improves per-
formance of Atomic instructions by 5 to 20×, to a few microseconds. A special hardware
unit associated with the L2 cache, not inside the multithreaded SIMD Processors, handles
atomic instructions.
Similarities And Differences Between Vector Architectures And
GPUs
As we have seen, there really are many similarities between vector architectures and GPUs.
Along with the quirky jargon of GPUs, these similarities have contributed to the confusion
in architecture circles about how novel GPUs really are. Now that you've seen what is under
the covers of vector computers and GPUs, you can appreciate both the similarities and the
diferences. Since both architectures are designed to execute data-level parallel programs, but
take diferent paths, this comparison is in depth to try to gain beter understanding of what is
needed for DLP hardware. Figure 4.21 shows the vector term first and then the closest equi-
valent in a GPU.
Search WWH ::




Custom Search