Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

instructions automatically. For example, advanced compilers today can generate SIMD

loating-point instructions to deliver much higher performance for scientific codes. However,

programmers must be sure to align all the data in memory to the width of the SIMD unit on

which the code is run to prevent the compiler from generating scalar instructions for other-

wise vectorizable code.

The Roofline Visual Performance Model

One visual, intuitive way to compare potential floating-point performance of variations of

SIMD architectures is the Roofline model [ Williams et al. 2009 ]. It ties together floating-point

performance, memory performance, and arithmetic intensity in a two-dimensional graph.

Arithmetic intensity is the ratio of floating-point operations per byte of memory accessed. It can

be calculated by taking the total number of floating-point operations for a program divided by

the total number of data bytes transferred to main memory during program execution. Figure

4.10 shows the relative arithmetic intensity of several example kernels.

FIGURE 4.10 Arithmetic intensity, specified as the number of floating-point operations

to run the program divided by the number of bytes accessed in main memory [ Williams

et al. 2009 ] . Some kernels have an arithmetic intensity that scales with problem size, such as

dense matrix, but there are many kernels with arithmetic intensities independent of problem

size.

Peak floating-point performance can be found using the hardware specifications. Many of

the kernels in this case study do not fit in on-chip caches, so peak memory performance is

deined by the memory system behind the caches. Note that we need the peak memory band-

width that is available to the processors, not just at the DRAM pins as in Figure 4.27 on page

325. One way to find the (delivered) peak memory performance is to run the Stream bench-

mark.

Figure 4.11 shows the Roofline model for the NEC SX-9 vector processor on the left and the

Intel Core i7 920 multicore computer on the right. The vertical Y -axis is achievable loating-

point performance from 2 to 256 GFLOP/sec. The horizontal X -axis is arithmetic intensity, vary-

ing from 1/8th FLOP/DRAM byte accessed to 16 FLOP/ DRAM byte accessed in both graphs.

Note that the graph is a log-log scale, and that Rooflines are done just once for a computer.

Search WWH ::

Custom Search

Home