Hardware Reference
In-Depth Information
instructions automatically. For example, advanced compilers today can generate SIMD
loating-point instructions to deliver much higher performance for scientific codes. However,
programmers must be sure to align all the data in memory to the width of the SIMD unit on
which the code is run to prevent the compiler from generating scalar instructions for other-
wise vectorizable code.
The Roofline Visual Performance Model
One visual, intuitive way to compare potential floating-point performance of variations of
SIMD architectures is the Roofline model [ Williams et al. 2009 ]. It ties together floating-point
performance, memory performance, and arithmetic intensity in a two-dimensional graph.
Arithmetic intensity is the ratio of floating-point operations per byte of memory accessed. It can
be calculated by taking the total number of floating-point operations for a program divided by
the total number of data bytes transferred to main memory during program execution. Figure
4.10 shows the relative arithmetic intensity of several example kernels.
FIGURE 4.10 Arithmetic intensity, specified as the number of floating-point operations
to run the program divided by the number of bytes accessed in main memory [ Williams
et al. 2009 ] . Some kernels have an arithmetic intensity that scales with problem size, such as
dense matrix, but there are many kernels with arithmetic intensities independent of problem
size.
Peak floating-point performance can be found using the hardware specifications. Many of
the kernels in this case study do not fit in on-chip caches, so peak memory performance is
deined by the memory system behind the caches. Note that we need the peak memory band-
width that is available to the processors, not just at the DRAM pins as in Figure 4.27 on page
325. One way to find the (delivered) peak memory performance is to run the Stream bench-
mark.
Figure 4.11 shows the Roofline model for the NEC SX-9 vector processor on the left and the
Intel Core i7 920 multicore computer on the right. The vertical Y -axis is achievable loating-
point performance from 2 to 256 GFLOP/sec. The horizontal X -axis is arithmetic intensity, vary-
ing from 1/8th FLOP/DRAM byte accessed to 16 FLOP/ DRAM byte accessed in both graphs.
Note that the graph is a log-log scale, and that Rooflines are done just once for a computer.
 
Search WWH ::




Custom Search