Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Pitfall Increasing Vector Performance, Without Comparable Increases In Scalar

Performance

This imbalance was a problem on many early vector processors, and a place where Seymour

Cray (the architect of the Cray computers) rewrote the rules. Many of the early vector pro-

cessors had comparatively slow scalar units (as well as large start-up overheads). Even today,

a processor with lower vector performance but beter scalar performance can outperform a

processor with higher peak vector performance. Good scalar performance keeps down over-

head costs (strip mining, for example) and reduces the impact of Amdahl's law.

A good example of this comes from comparing a fast scalar processor and a vector processor

with lower scalar performance. The Livermore FORTRAN kernels are a collection of 24 sci-

entific kernels with varying degrees of vectorization. Figure 4.31 shows the performance of

two different processors on this benchmark. Despite the vector processor's higher peak per-

formance, its low scalar performance makes it slower than a fast scalar processor as measured

by the harmonic mean.

FIGURE 4.31 Performance measurements for the Livermore FORTRAN kernels on two

different processors . Both the MIPS M/120-5 and the Stardent-1500 (formerly the Ardent

Titan-1) use a 16.7 MHz MIPS R2000 chip for the main CPU. The Stardent-1500 uses its vec-

tor unit for scalar FP and has about half the scalar performance (as measured by the minim-

um rate) of the MIPS M/120-5, which uses the MIPS R2010 FP chip. The vector processor is

more than a factor of 2.5× faster for a highly vectorizable loop (maximum rate). However, the

lower scalar performance of the Stardent-1500 negates the higher vector performance when

total performance is measured by the harmonic mean on all 24 loops.

The flip of this danger today is increasing vector performance—say, by increasing the num-

ber of lanes—without increasing scalar performance. Such myopia is another path to an un-

balanced computer.

The next fallacy is closely related.

Fallacy You Can Get Good V Ector Performance Without Providing Memory

Bandwidth

As we saw with the DAXPY loop and the Roofline model, memory bandwidth is quite import-

ant to all SIMD architectures. DAXPY requires 1.5 memory references per floating-point op-

eration, and this ratio is typical of many scientific codes. Even if the floating-point operations

took no time, a Cray-1 could not increase the performance of the vector sequence used, since

it is memory limited. The Cray-1 performance on Linpack jumped when the compiler used

Search WWH ::

Custom Search

Home