Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

■ Caches for GPU Memory —While the GPU philosophy is to have enough threads to hide

DRAM latency, there are variables that are needed across threads, such as local variables

mentioned above. Fermi includes both an L1 Data Cache and L1 Instruction Cache for each

multithreaded SIMD Processor and a single 768 KB L2 cache shared by all multithreaded

SIMD Processors in the GPU. As mentioned above, in addition to reducing bandwidth

pressure on GPU Memory, caches can save energy by staying on-chip rather than going

of-chip to DRAM. The L1 cache actually cohabits the same SRAM as Local Memory. Fermi

has a mode bit that offers the choice of using 64 KB of SRAM as a 16 KB L1 cache with 48

KB of Local Memory or as a 48 KB L1 cache with 16 KB of Local Memory. Note that the

GTX 480 has an inverted memory hierarchy: The size of the aggregate register file is 2 MB,

the size of all the L1 data caches is between 0.25 and 0.75 MB (depending on whether they

are 16 KB or 48 KB), and the size of the L2 cache is 0.75 MB. It will be interesting to see the

impact of this inverted ratio on GPU applications.

■ 64-Bit Addressing and a Unified Address Space for All GPU Memories —This innovation makes

it much easier to provide the pointers needed for C and C++.

■ Error Correcting Codes to detect and correct errors in memory and registers (see Chapter

2 ) —To make long-running applications dependable on thousands of servers, ECC is the

norm in the datacenter (see Chapter 6 ) .

■ Faster Context Switching —Given the large state of a multithreaded SIMD Processor, Fermi

has hardware support to switch contexts much more quickly. Fermi can switch in less than

25 microseconds, about 10× faster than its predecessor can.

■ Faster Atomic Instructions —First included in the Tesla architecture, Fermi improves per-

formance of Atomic instructions by 5 to 20×, to a few microseconds. A special hardware

unit associated with the L2 cache, not inside the multithreaded SIMD Processors, handles

atomic instructions.

Similarities And Differences Between Vector Architectures And

GPUs

As we have seen, there really are many similarities between vector architectures and GPUs.

Along with the quirky jargon of GPUs, these similarities have contributed to the confusion

in architecture circles about how novel GPUs really are. Now that you've seen what is under

the covers of vector computers and GPUs, you can appreciate both the similarities and the

diferences. Since both architectures are designed to execute data-level parallel programs, but

take diferent paths, this comparison is in depth to try to gain beter understanding of what is

needed for DLP hardware. Figure 4.21 shows the vector term first and then the closest equi-

valent in a GPU.

Search WWH ::

Custom Search

Home