Hardware Reference
In-Depth Information
We hope that there will be more such multicore-GPU comparisons. Note that an important
feature missing from this comparison was describing the level of effort to get the results for
the two systems. Ideally, future comparisons would release the code used on both systems so
that others could recreate the same experiments on different hardware platforms and possibly
improve on the results.
4.8 Fallacies and Pitfalls
While data-level parallelism is the easiest form of parallelism after ILP from the programmer's
perspective, and plausibly the easiest from the architect's perspective, it still has many fallacies
and pitfalls.
Fallacy GPUs Suffer From Being Coprocessors
While the split between main memory and GPU memory has disadvantages, there are advant-
ages to being at a distance from the CPU.
For example, PTX exists in part because of the I/O device nature of GPUs. This level of in-
direction between the compiler and the hardware gives GPU architects much more lexibil-
ity than system processor architects. It's often hard to know in advance whether an architec-
ture innovation will be well supported by compilers and libraries and be important to applic-
ations. Sometimes a new mechanism will even prove useful for one or two generations and
then fade in importance as the IT world changes. PTX allows GPU architects to try innova-
tions speculatively and drop them in subsequent generations if they disappoint or fade in im-
portance, which encourages experimentation. The justification for inclusion is understandably
much higher for system processors—and hence much less experimentation can occur—as dis-
tributing binary machine code normally implies that new features must be supported by all
future generations of that architecture.
A demonstration of the value of PTX is that the Fermi architecture radically changed the
hardware instruction set—from being memory-oriented like x86 to being register-oriented like
MIPS as well as doubling the address size to 64 bits—without disrupting the NVIDIA software
stack.
Pitfall Concentrating On Peak Performance In Vector Architectures And Ignoring
Start-up Overhead
Early memory-memory vector processors such as the TI ASC and the CDC STAR-100 had long
start-up times. For some vector problems, vectors had to be longer than 100 for the vector code
to be faster than the scalar code! On the CYBER 205—derived from the STAR-100—the start-up
overhead for DAXPY is 158 clock cycles, which substantially increases the break-even point. If
the clock rates of the Cray-1 and the CYBER 205 were identical, the Cray-1 would be faster un-
til the vector length is greater than 64. Because the Cray-1 clock was also faster (even though
the 205 was newer), the crossover point was a vector length over 100.
 
Search WWH ::




Custom Search