Hardware Reference
In-Depth Information
The challenge for the GPU programmer is not simply geting good performance on the GPU,
but also in coordinating the scheduling of computation on the system processor and the GPU
and the transfer of data between system memory and GPU memory. Moreover, as we see shall
see later in this section, GPUs have virtually every type of parallelism that can be captured by
the programming environment: multithreading, MIMD, SIMD, and even instruction-level.
NVIDIA decided to develop a C-like language and programming environment that would
improve the productivity of GPU programmers by atacking both the challenges of hetero-
geneous computing and of multifaceted parallelism. The name of their system is CUDA , for
Compute Unified Device Architecture. CUDA produces C/C++ for the system processor ( host )
and a C and C++ dialect for the GPU ( device , hence the D in CUDA). A similar programming
language is OpenCL, which several companies are developing to offer a vendor-independent
language for multiple platforms.
NVIDIA decided that the unifying theme of all these forms of parallelism is the CUDA
Thread . Using this lowest level of parallelism as the programming primitive, the compiler and
the hardware can gang thousands of CUDA Threads together to utilize the various styles
of parallelism within a GPU: multithreading, MIMD, SIMD, and instruction-level parallel-
ism. Hence, NVIDIA classifies the CUDA programming model as Single Instruction, Multiple
Thread ( SIMT ). For reasons we shall soon see, these threads are blocked together and executed
in groups of 32 threads, called a Thread Block . We call the hardware that executes a whole block
of threads a multithreaded SIMD Processor .
We need just a few details before we can give an example of a CUDA program:
■ To distinguish between functions for the GPU (device) and functions for the system pro-
cessor (host), CUDA uses __device__ or __global__ for the former and __host__ for the later.
■ CUDA variables declared as in the __device__ or __global__functions are allocated to the GPU
Memory (see below), which is accessible by all multithreaded SIMD processors.
■ The extended function call syntax for the function name that runs on the GPU is
name <<<dimGrid, dimBlock>>>(… parameter list …)
where dimGrid and dimBlock specify the dimensions of the code (in blocks) and the dimen-
sions of a block (in threads).
■ In addition to the identifier for blocks ( blockIdx ) and the identifier for threads per block
( threadIdx ), CUDA provides a keyword for the number of threads per block ( blockDim ),
which comes from the dimBlock parameter in the bullet above.
Before seeing the CUDA code, let's start with conventional C code for the DAXPY loop from
Section 4.2 :
// Invoke DAXPY
daxpy(n, 2.0, x, y);
// DAXPY in C
void daxpy(int n, double a, double *x, double *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
Below is the CUDA version. We launch n threads, one per vector element, with 256 CUDA
Threads per thread block in a multithreaded SIMD Processor. The GPU function starts by cal-
culating the corresponding element index i based on the block ID, the number of threads per
block, and the thread ID. As long as this index is within the array ( i < n ), it performs the mul-
tiply and add.
// Invoke DAXPY with 256 threads per Thread Block
Search WWH ::




Custom Search