Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

The challenge for the GPU programmer is not simply geting good performance on the GPU,

but also in coordinating the scheduling of computation on the system processor and the GPU

and the transfer of data between system memory and GPU memory. Moreover, as we see shall

see later in this section, GPUs have virtually every type of parallelism that can be captured by

the programming environment: multithreading, MIMD, SIMD, and even instruction-level.

NVIDIA decided to develop a C-like language and programming environment that would

improve the productivity of GPU programmers by atacking both the challenges of hetero-

geneous computing and of multifaceted parallelism. The name of their system is CUDA , for

Compute Unified Device Architecture. CUDA produces C/C++ for the system processor ( host )

and a C and C++ dialect for the GPU ( device , hence the D in CUDA). A similar programming

language is OpenCL, which several companies are developing to offer a vendor-independent

language for multiple platforms.

NVIDIA decided that the unifying theme of all these forms of parallelism is the CUDA

Thread . Using this lowest level of parallelism as the programming primitive, the compiler and

the hardware can gang thousands of CUDA Threads together to utilize the various styles

of parallelism within a GPU: multithreading, MIMD, SIMD, and instruction-level parallel-

ism. Hence, NVIDIA classifies the CUDA programming model as Single Instruction, Multiple

Thread ( SIMT ). For reasons we shall soon see, these threads are blocked together and executed

in groups of 32 threads, called a Thread Block . We call the hardware that executes a whole block

of threads a multithreaded SIMD Processor .

We need just a few details before we can give an example of a CUDA program:

■ To distinguish between functions for the GPU (device) and functions for the system pro-

cessor (host), CUDA uses __device__ or __global__ for the former and __host__ for the later.

■ CUDA variables declared as in the __device__ or __global__functions are allocated to the GPU

Memory (see below), which is accessible by all multithreaded SIMD processors.

■ The extended function call syntax for the function name that runs on the GPU is

name <<<dimGrid, dimBlock>>>(… parameter list …)

where dimGrid and dimBlock specify the dimensions of the code (in blocks) and the dimen-

sions of a block (in threads).

■ In addition to the identifier for blocks ( blockIdx ) and the identifier for threads per block

( threadIdx ), CUDA provides a keyword for the number of threads per block ( blockDim ),

which comes from the dimBlock parameter in the bullet above.

Before seeing the CUDA code, let's start with conventional C code for the DAXPY loop from

Section 4.2 :

// Invoke DAXPY

daxpy(n, 2.0, x, y);

// DAXPY in C

void daxpy(int n, double a, double *x, double *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

Below is the CUDA version. We launch n threads, one per vector element, with 256 CUDA

Threads per thread block in a multithreaded SIMD Processor. The GPU function starts by cal-

culating the corresponding element index i based on the block ID, the number of threads per

block, and the thread ID. As long as this index is within the array ( i < n ), it performs the mul-

tiply and add.

// Invoke DAXPY with 256 threads per Thread Block

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home