Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

and the vertical dashed line on the left (arithmetic intensity of 1/4) is an example of the latter.

Given a Roofline model of a computer, you can apply it repeatedly, since it doesn't vary by

kernel.

Note that the “ridge point,” where the diagonal and horizontal roofs meet, offers an inter-

esting insight into the computer. If it is far to the right, then only kernels with very high arith-

metic intensity can achieve the maximum performance of that computer. If it is far to the left,

then almost any kernel can potentially hit the maximum performance. As we shall see, this

vector processor has both much higher memory bandwidth and a ridge point far to the left

when compared to other SIMD processors.

Figure 4.11 shows that the peak computational performance of the SX-9 is 2.4× faster than

Core i7, but the memory performance is 10× faster. For programs with an arithmetic intens-

ity of 0.25, the SX-9 is 10× faster (40.5 versus 4.1 GFLOP/sec). The higher memory bandwidth

moves the ridge point from 2.6 in the Core i7 to 0.6 on the SX-9, which means many more pro-

grams can reach peak computational performance on the vector processor.

4.4 Graphics Processing Units

For a few hundred dollars, anyone can buy a GPU with hundreds of parallel loating-point

units, which makes high-performance computing more accessible. The interest in GPU com-

puting blossomed when this potential was combined with a programming language that

made GPUs easier to program. Hence, many programmers of scientific and multimedia ap-

plications today are pondering whether to use GPUs or CPUs.

GPUs and CPUs do not go back in computer architecture genealogy to a common ancestor;

there is no Missing Link that explains both. As Section 4.10 describes, the primary ancestors

of GPUs are graphics accelerators, as doing graphics well is the reason why GPUs exist. While

GPUs are moving toward mainstream computing, they can't abandon their responsibility to

continue to excel at graphics. Thus, the design of GPUs may make more sense when architects

ask, given the hardware invested to do graphics well, how can we supplement it to improve

the performance of a wider range of applications?

Note that this section concentrates on using GPUs for computing. To see how GPU comput-

ing combines with the traditional role of graphics acceleration, see “Graphics and Computing

GPUs,” by John Nickolls and David Kirk ( Appendix A in the 4th edition of Computer Organiz-

ation and Design by the same authors as this topic).

Since the terminology and some hardware features are quite different from vector and SIMD

architectures, we believe it will be easier if we start with the simplified programming model

for GPUs before we describe the architecture.

Programming The GPU

CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all all

gorithms, but enough to mater. It seems to resonate in some way with the way we think and code,

allowing an easier, more natural expression of parallelism beyond the task level.

Vincent Natol

“Kudos for CUDA,” HPC Wire (2010)

Search WWH ::

Custom Search

Home