Hardware Reference
In-Depth Information
and the vertical dashed line on the left (arithmetic intensity of 1/4) is an example of the latter.
Given a Roofline model of a computer, you can apply it repeatedly, since it doesn't vary by
kernel.
Note that the “ridge point,” where the diagonal and horizontal roofs meet, offers an inter-
esting insight into the computer. If it is far to the right, then only kernels with very high arith-
metic intensity can achieve the maximum performance of that computer. If it is far to the left,
then almost any kernel can potentially hit the maximum performance. As we shall see, this
vector processor has both much higher memory bandwidth and a ridge point far to the left
when compared to other SIMD processors.
Figure 4.11 shows that the peak computational performance of the SX-9 is 2.4× faster than
Core i7, but the memory performance is 10× faster. For programs with an arithmetic intens-
ity of 0.25, the SX-9 is 10× faster (40.5 versus 4.1 GFLOP/sec). The higher memory bandwidth
moves the ridge point from 2.6 in the Core i7 to 0.6 on the SX-9, which means many more pro-
grams can reach peak computational performance on the vector processor.
4.4 Graphics Processing Units
For a few hundred dollars, anyone can buy a GPU with hundreds of parallel loating-point
units, which makes high-performance computing more accessible. The interest in GPU com-
puting blossomed when this potential was combined with a programming language that
made GPUs easier to program. Hence, many programmers of scientific and multimedia ap-
plications today are pondering whether to use GPUs or CPUs.
GPUs and CPUs do not go back in computer architecture genealogy to a common ancestor;
there is no Missing Link that explains both. As Section 4.10 describes, the primary ancestors
of GPUs are graphics accelerators, as doing graphics well is the reason why GPUs exist. While
GPUs are moving toward mainstream computing, they can't abandon their responsibility to
continue to excel at graphics. Thus, the design of GPUs may make more sense when architects
ask, given the hardware invested to do graphics well, how can we supplement it to improve
the performance of a wider range of applications?
Note that this section concentrates on using GPUs for computing. To see how GPU comput-
ing combines with the traditional role of graphics acceleration, see “Graphics and Computing
GPUs,” by John Nickolls and David Kirk ( Appendix A in the 4th edition of Computer Organiz-
ation and Design by the same authors as this topic).
Since the terminology and some hardware features are quite different from vector and SIMD
architectures, we believe it will be easier if we start with the simplified programming model
for GPUs before we describe the architecture.
Programming The GPU
CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all all
gorithms, but enough to mater. It seems to resonate in some way with the way we think and code,
allowing an easier, more natural expression of parallelism beyond the task level.
Vincent Natol
“Kudos for CUDA,” HPC Wire (2010)
 
Search WWH ::




Custom Search