Hardware Reference
In-Depth Information
Like vector architectures, GPUs work well only with data-level parallel problems. Both
styles have gather-scater data transfers and mask registers, and GPU processors have even
more registers than do vector processors. Since they do not have a close-by scalar processor,
GPUs sometimes implement a feature at runtime in hardware that vector computers im-
plement at compiler time in software. Unlike most vector architectures, GPUs also rely on
multithreading within a single multi-threaded SIMD processor to hide memory latency (see
Chapters 2 and 3 ). However, efficient code for both vector architectures and GPUs requires
programmers to think in groups of SIMD operations.
A Grid is the code that runs on a GPU that consists of a set of Thread Blocks . Figure 4.12
draws the analogy between a grid and a vectorized loop and between a Thread Block and the
body of that loop (after it has been strip-mined, so that it is a full computation loop). To give a
concrete example, let's suppose we want to multiply two vectors together, each 8192 elements
long. We'll return to this example throughout this section. Figure 4.13 shows the relationship
between this example and these first two GPU terms. The GPU code that works on the whole
8192 element multiply is called a Grid (or vectorized loop). To break it down into more man-
ageable sizes, a Grid is composed of Thread Blocks (or body of a vectorized loop), each with up
to 512 elements. Note that a SIMD instruction executes 32 elements at a time. With 8192 ele-
ments in the vectors, this example thus has 16 Thread Blocks since 16 = 8192 รท 512. The Grid
and Thread Block are programming abstractions implemented in GPU hardware that help
programmers organize their CUDA code. (The Thread Block is analogous to a strip-minded
vector loop with a vector length of 32.)
Search WWH ::




Custom Search