Hardware Reference
In-Depth Information
While hiding memory latency is the underlying philosophy, note that the latest GPUs and
vector processors have added caches. For example, the recent Fermi architecture has ad-
ded caches, but they are thought of as either bandwidth filters to reduce demands on GPU
Memory or as accelerators for the few variables whose latency cannot be hidden by multith-
reading. Thus, local memory for stack frames, function calls, and register spilling is a good
match to caches, since latency maters when calling a function. Caches also save energy, since
on-chip cache accesses take much less energy than accesses to multiple, external DRAM chips.
To improve memory bandwidth and reduce overhead, as mentioned above, PTX data trans-
fer instructions coalesce individual parallel thread requests from the same SIMD thread to-
gether into a single memory block request when the addresses fall in the same block. These
restrictions are placed on the GPU program, somewhat analogous to the guidelines for sys-
tem processor programs to engage hardware prefetching (see Chapter 2 ). The GPU memory
controller will also hold requests and send ones to the same open page together to improve
memory bandwidth (see Section 4.6 ). Chapter 2 describes DRAM in sufficient detail to under-
stand the potential benefits of grouping related addresses.
Innovations In The Fermi GPU Architecture
The multithreaded SIMD Processor of Fermi is more complicated than the simplified version
in Figure 4.14 . To increase hardware utilization, each SIMD Processor has two SIMD Thread
Schedulers and two instruction dispatch units. The dual SIMD Thread Scheduler selects two
threads of SIMD instructions and issues one instruction from each to two sets of 16 SIMD
Lanes, 16 load/store units, or 4 special function units. Thus, two threads of SIMD instructions
are scheduled every two clock cycles to any of these collections. Since the threads are inde-
pendent, there is no need to check for data dependences in the instruction stream. This innova-
tion would be analogous to a multithreaded vector processor that can issue vector instructions
from two independent threads.
Figure 4.19 shows the Dual Scheduler issuing instructions and Figure 4.20 shows the block
diagram of the multithreaded SIMD Processor of a Fermi GPU.
Search WWH ::




Custom Search