Hardware Reference
In-Depth Information
Vector compilers could do the same tricks with mask registers as GPUs do in hardware, but
it would involve scalar instructions to save, complement, and restore mask registers. Condi-
tional execution is a case where GPUs do in runtime hardware what vector architectures do
at compile time. One optimization available at runtime for GPUs but not at compile time for
vector architectures is to skip the THEN or ELSE parts when mask bits are all zeros or all ones.
Thus, the efficiency with which GPUs execute conditional statements comes down to how
frequently the branches would diverge. For example, one calculation of eigenvalues has deep
conditional nesting, but measurements of the code show that around 82% of clock cycle issues
have between 29 and 32 out of the 32 mask bits set to one, so GPUs execute this code more
eiciently than one might expect.
Note that the same mechanism handles the strip-mining of vector loops—when the number
of elements doesn't perfectly match the hardware. The example at the beginning of this section
shows that an IF statement checks to see if this SIMD Lane element number (stored in R8 in the
example above) is less than the limit (i < n) , and it sets masks appropriately.
NVIDIA GPU Memory Structures
Figure 4.18 shows the memory structures of an NVIDIA GPU. Each SIMD Lane in a multith-
readed SIMD Processor is given a private section of off-chip DRAM, which we call the Private
Memory . It is used for the stack frame, for spilling registers, and for private variables that don't
it in the registers. SIMD Lanes do not share Private Memories. Recent GPUs cache this Private
Memory in the L1 and L2 caches to aid register spilling and to speed up function calls.
Search WWH ::




Custom Search