Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 4.18 GPU Memory structures . GPU Memory is shared by all Grids (vectorized

loops), Local Memory is shared by all threads of SIMD instructions within a thread block (body

of a vectorized loop), and Private Memory is private to a single CUDA Thread.

We call the on-chip memory that is local to each multithreaded SIMD Processor Local

Memory . It is shared by the SIMD Lanes within a multithreaded SIMD Processor, but this

memory is not shared between multithreaded SIMD Processors. The multithreaded SIMD Pro-

cessor dynamically allocates portions of the Local Memory to a thread block when it creates

the thread block, and frees the memory when all the threads of the thread block exit. That por-

tion of Local Memory is private to that thread block.

Finally, we call the off-chip DRAM shared by the whole GPU and all thread blocks GPU

Memory . Our vector multiply example only used GPU Memory.

The system processor, called the host , can read or write GPU Memory. Local Memory is un-

available to the host, as it is private to each multithreaded SIMD processor. Private Memories

are unavailable to the host as well.

Rather than rely on large caches to contain the whole working sets of an application, GPUs

traditionally use smaller streaming caches and rely on extensive multithreading of threads of

SIMD instructions to hide the long latency to DRAM, since their working sets can be hundreds

of megabytes. Given the use of multithreading to hide DRAM latency, the chip area used for

caches in system processors is spent instead on computing resources and on the large number

of registers to hold the state of many threads of SIMD instructions. In contrast, as mentioned

above, vector loads and stores amortize the latency across many elements, since they only pay

the latency once and then pipeline the rest of the accesses.

Search WWH ::

Custom Search

Home