Hardware Reference
In-Depth Information
Since the thread consists of SIMD instructions, the SIMD Processor must have parallel func-
tional units to perform the operation. We call them SIMD Lanes , and they are quite similar to
the Vector Lanes in Section 4.2 .
The number of lanes per SIMD processor varies across GPU generations. With Fermi, each
32-wide thread of SIMD instructions is mapped to 16 physical SIMD Lanes, so each SIMD in-
struction in a thread of SIMD instructions takes two clock cycles to complete. Each thread of
SIMD instructions is executed in lock step and only scheduled at the beginning. Staying with
the analogy of a SIMD Processor as a vector processor, you could say that it has 16 lanes, the
vector length would be 32, and the chime is 2 clock cycles. (This wide but shallow nature is
why we use the term SIMD Processor instead of vector processor as it is more descriptive.)
Since by definition the threads of SIMD instructions are independent, the SIMD Thread
Scheduler can pick whatever thread of SIMD instructions is ready, and need not stick with the
next SIMD instruction in the sequence within a thread. The SIMD Thread Scheduler includes a
scoreboard (see Chapter 3 ) to keep track of up to 48 threads of SIMD instructions to see which
SIMD instruction is ready to go. This scoreboard is needed because memory access instruc-
tions can take an unpredictable number of clock cycles due to memory bank conflicts, for ex-
ample. Figure 4.16 shows the SIMD Thread Scheduler picking threads of SIMD instructions in
a different order over time. The assumption of GPU architects is that GPU applications have
so many threads of SIMD instructions that multithreading can both hide the latency to DRAM
and increase utilization of multithreaded SIMD Processors. However, to hedge their bets, the
recent NVIDIA Fermi GPU includes an L2 cache (see Section 4.7 ) .
Search WWH ::




Custom Search