Hardware Reference
In-Depth Information
FIGURE 4.11 Roofline model for one NEC SX-9 vector processor on the left and the In-
tel Core i7 920 multicore computer with SIMD Extensions on the right [ Williams et al.
2009 ] . This Roofline is for unit-stride memory accesses and double-precision floating-point
performance. NEC SX-9 is a vector supercomputer announced in 2008 that costs millions of
dollars. It has a peak DP FP performance of 102.4 GFLOP/sec and a peak memory band-
width of 162 GBytes/sec from the Stream benchmark. The Core i7 920 has a peak DP FP per-
formance of 42.66 GFLOP/sec and a peak memory bandwidth of 16.4 GBytes/sec. The
dashed vertical lines at an arithmetic intensity of 4 FLOP/byte show that both processors op-
erate at peak performance. In this case, the SX-9 at 102.4 FLOP/sec is 2.4× faster than the
Core i7 at 42.66 GFLOP/sec. At an arithmetic intensity of 0.25 FLOP/byte, the SX-9 is 10×
faster at 40.5 GFLOP/sec versus 4.1 GFLOP/sec for the Core i7.
For a given kernel, we can find a point on the X -axis based on its arithmetic intensity. If we
drew a vertical line through that point, the performance of the kernel on that computer must
lie somewhere along that line. We can plot a horizontal line showing peak floating-point per-
formance of the computer. Obviously, the actual floating-point performance can be no higher
than the horizontal line, since that is a hardware limit.
How could we plot the peak memory performance? Since the X -axis is FLOP/byte and the
Y -axis is FLOP/sec, bytes/sec is just a diagonal line at a 45-degree angle in this figure. Hence,
we can plot a third line that gives the maximum floating-point performance that the memory
system of that computer can support for a given arithmetic intensity. We can express the limits
as a formula to plot these lines in the graphs in Figure 4.11 :
The horizontal and diagonal lines give this simple model its name and indicate its value.
The “Roofline” sets an upper bound on performance of a kernel depending on its arithmetic
intensity. If we think of arithmetic intensity as a pole that hits the roof, either it hits the flat part
of the roof, which means performance is computationally limited, or it hits the slanted part
of the roof, which means performance is ultimately limited by memory bandwidth. In Figure
4.11 , the vertical dashed line on the right (arithmetic intensity of 4) is an example of the former
 
Search WWH ::




Custom Search