Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

We hope that there will be more such multicore-GPU comparisons. Note that an important

feature missing from this comparison was describing the level of effort to get the results for

the two systems. Ideally, future comparisons would release the code used on both systems so

that others could recreate the same experiments on different hardware platforms and possibly

improve on the results.

4.8 Fallacies and Pitfalls

While data-level parallelism is the easiest form of parallelism after ILP from the programmer's

perspective, and plausibly the easiest from the architect's perspective, it still has many fallacies

and pitfalls.

Fallacy GPUs Suffer From Being Coprocessors

While the split between main memory and GPU memory has disadvantages, there are advant-

ages to being at a distance from the CPU.

For example, PTX exists in part because of the I/O device nature of GPUs. This level of in-

direction between the compiler and the hardware gives GPU architects much more lexibil-

ity than system processor architects. It's often hard to know in advance whether an architec-

ture innovation will be well supported by compilers and libraries and be important to applic-

ations. Sometimes a new mechanism will even prove useful for one or two generations and

then fade in importance as the IT world changes. PTX allows GPU architects to try innova-

tions speculatively and drop them in subsequent generations if they disappoint or fade in im-

portance, which encourages experimentation. The justification for inclusion is understandably

much higher for system processors—and hence much less experimentation can occur—as dis-

tributing binary machine code normally implies that new features must be supported by all

future generations of that architecture.

A demonstration of the value of PTX is that the Fermi architecture radically changed the

hardware instruction set—from being memory-oriented like x86 to being register-oriented like

MIPS as well as doubling the address size to 64 bits—without disrupting the NVIDIA software

stack.

Pitfall Concentrating On Peak Performance In Vector Architectures And Ignoring

Start-up Overhead

Early memory-memory vector processors such as the TI ASC and the CDC STAR-100 had long

start-up times. For some vector problems, vectors had to be longer than 100 for the vector code

to be faster than the scalar code! On the CYBER 205—derived from the STAR-100—the start-up

overhead for DAXPY is 158 clock cycles, which substantially increases the break-even point. If

the clock rates of the Cray-1 and the CYBER 205 were identical, the Cray-1 would be faster un-

til the vector length is greater than 64. Because the Cray-1 clock was also faster (even though

the 205 was newer), the crossover point was a vector length over 100.

Search WWH ::

Custom Search

Home