Hardware Reference
In-Depth Information
Pitfall Increasing Vector Performance, Without Comparable Increases In Scalar
Performance
This imbalance was a problem on many early vector processors, and a place where Seymour
Cray (the architect of the Cray computers) rewrote the rules. Many of the early vector pro-
cessors had comparatively slow scalar units (as well as large start-up overheads). Even today,
a processor with lower vector performance but beter scalar performance can outperform a
processor with higher peak vector performance. Good scalar performance keeps down over-
head costs (strip mining, for example) and reduces the impact of Amdahl's law.
A good example of this comes from comparing a fast scalar processor and a vector processor
with lower scalar performance. The Livermore FORTRAN kernels are a collection of 24 sci-
entific kernels with varying degrees of vectorization. Figure 4.31 shows the performance of
two different processors on this benchmark. Despite the vector processor's higher peak per-
formance, its low scalar performance makes it slower than a fast scalar processor as measured
by the harmonic mean.
FIGURE 4.31 Performance measurements for the Livermore FORTRAN kernels on two
different processors . Both the MIPS M/120-5 and the Stardent-1500 (formerly the Ardent
Titan-1) use a 16.7 MHz MIPS R2000 chip for the main CPU. The Stardent-1500 uses its vec-
tor unit for scalar FP and has about half the scalar performance (as measured by the minim-
um rate) of the MIPS M/120-5, which uses the MIPS R2010 FP chip. The vector processor is
more than a factor of 2.5× faster for a highly vectorizable loop (maximum rate). However, the
lower scalar performance of the Stardent-1500 negates the higher vector performance when
total performance is measured by the harmonic mean on all 24 loops.
The flip of this danger today is increasing vector performance—say, by increasing the num-
ber of lanes—without increasing scalar performance. Such myopia is another path to an un-
balanced computer.
The next fallacy is closely related.
Fallacy You Can Get Good V Ector Performance Without Providing Memory
Bandwidth
As we saw with the DAXPY loop and the Roofline model, memory bandwidth is quite import-
ant to all SIMD architectures. DAXPY requires 1.5 memory references per floating-point op-
eration, and this ratio is typical of many scientific codes. Even if the floating-point operations
took no time, a Cray-1 could not increase the performance of the vector sequence used, since
it is memory limited. The Cray-1 performance on Linpack jumped when the compiler used
 
Search WWH ::




Custom Search