Hardware Reference
In-Depth Information
__host__
int nblocks = (n+ 255) / 256;
daxpy<<<nblocks, 256>>>(n, 2.0, x, y);
// DAXPY in CUDA
__device__
void daxpy(int n, double a, double *x, double *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
Comparing the C and CUDA codes, we see a common patern to parallelizing data-parallel
CUDA code. The C version has a loop where each iteration is independent of the others, allow-
ing the loop to be transformed straightforwardly into a parallel code where each loop iteration
becomes an independent thread. (As mentioned above and described in detail in Section 4.5 ,
vectorizing compilers also rely on a lack of dependences between iterations of a loop, which
are called loop carried dependences .) The programmer determines the parallelism in CUDA ex-
plicitly by specifying the grid dimensions and the number of threads per SIMD Processor.
By assigning a single thread to each element, there is no need to synchronize among threads
when writing results to memory.
The GPU hardware handles parallel execution and thread management; it is not done by
applications or by the operating system. To simplify scheduling by the hardware, CUDA re-
quires that thread blocks be able to execute independently and in any order. Different thread
blocks cannot communicate directly, although they can coordinate using atomic memory oper-
ations in Global Memory.
As we shall soon see, many GPU hardware concepts are not obvious in CUDA. That is a
good thing from a programmer productivity perspective, but most programmers are using
GPUs instead of CPUs to get performance. Performance programmers must keep the GPU
hardware in mind when writing in CUDA. For reasons explained shortly, they know that they
need to keep groups of 32 threads together in control flow to get the best performance from
multithreaded SIMD Processors, and create many more threads per multithreaded SIMD Pro-
cessor to hide latency to DRAM. They also need to keep the data addresses localized in one or
a few blocks of memory to get the expected memory performance.
Like many parallel systems, a compromise between productivity and performance is for
CUDA to include intrinsics to give programmers explicit control of the hardware. The struggle
between productivity on one hand versus allowing the programmer to be able to express any-
thing that the hardware can do on the other happens often in parallel computing. It will be
interesting to see how the language evolves in this classic productivity-performance batle as
well as to see if CUDA becomes popular for other GPUs or even other architectural styles.
NVIDIA GPU Computational Structures
The uncommon heritage mentioned above helps explain why GPUs have their own architec-
tural style and their own terminology independent from CPUs. One obstacle to understanding
GPUs has been the jargon, with some terms even having misleading names. This obstacle has
been surprisingly difficult to overcome, as the many rewrites of this chapter can attest. To try
to bridge the twin goals of making the architecture of GPUs understandable and learning the
many GPU terms with non traditional definitions, our final solution is to use the CUDA ter-
minology for software but initially use more descriptive terms for the hardware, sometimes
Search WWH ::




Custom Search