Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Like vector architectures, GPUs work well only with data-level parallel problems. Both

styles have gather-scater data transfers and mask registers, and GPU processors have even

more registers than do vector processors. Since they do not have a close-by scalar processor,

GPUs sometimes implement a feature at runtime in hardware that vector computers im-

plement at compiler time in software. Unlike most vector architectures, GPUs also rely on

multithreading within a single multi-threaded SIMD processor to hide memory latency (see

Chapters 2 and 3 ). However, efficient code for both vector architectures and GPUs requires

programmers to think in groups of SIMD operations.

A Grid is the code that runs on a GPU that consists of a set of Thread Blocks . Figure 4.12

draws the analogy between a grid and a vectorized loop and between a Thread Block and the

body of that loop (after it has been strip-mined, so that it is a full computation loop). To give a

concrete example, let's suppose we want to multiply two vectors together, each 8192 elements

long. We'll return to this example throughout this section. Figure 4.13 shows the relationship

between this example and these first two GPU terms. The GPU code that works on the whole

8192 element multiply is called a Grid (or vectorized loop). To break it down into more man-

ageable sizes, a Grid is composed of Thread Blocks (or body of a vectorized loop), each with up

to 512 elements. Note that a SIMD instruction executes 32 elements at a time. With 8192 ele-

ments in the vectors, this example thus has 16 Thread Blocks since 16 = 8192 ÷ 512. The Grid

and Thread Block are programming abstractions implemented in GPU hardware that help

programmers organize their CUDA code. (The Thread Block is analogous to a strip-minded

vector loop with a vector length of 32.)

Search WWH ::

Custom Search

Home