Hardware Reference
In-Depth Information
the system processor that is associated with a GPU is the closest analogy to a scalar processor
in a vector architecture, the separate address spaces plus transferring over a PCle bus means
thousands of clock cycles of overhead to use them together. The scalar processor can be slower
than a vector processor for floating-point computations in a vector computer, but not by the
same ratio as the system processor versus a multithreaded SIMD Processor (given the over-
head).
Hence, each “vector unit” in a GPU must do computations that you would expect to do on
a scalar processor in a vector computer. That is, rather than calculate on the system processor
and communicate the results, it can be faster to disable all but one SIMD Lane using the pre-
dicate registers and built-in masks and do the scalar work with one SIMD Lane. The relatively
simple scalar processor in a vector computer is likely to be faster and more power eicient
than the GPU solution. If system processors and GPUs become more closely tied together in
the future, it will be interesting to see if system processors can play the same role as scalar
processors do for vector and Multimedia SIMD architectures.
Similarities And Differences Between Multimedia SIMD
Computers And GPUs
At a high level, multicore computers with Multimedia SIMD instruction extensions do share
similarities with GPUs. Figure 4.23 summarizes the similarities and differences.
FIGURE 4.23 Similarities and differences between multicore with Multimedia SIMD ex-
tensions and recent GPUs .
Both are multiprocessors whose processors use multiple SIMD lanes, although GPUs have
more processors and many more lanes. Both use hardware multithreading to improve pro-
cessor utilization, although GPUs have hardware support for many more threads. Recent in-
novations in GPUs mean that now both have similar performance ratios between single-preci-
sion and double-precision floating-point arithmetic. Both use caches, although GPUs use smal-
ler streaming caches and multicore computers use large multilevel caches that try to contain
whole working sets completely. Both use a 64-bit address space, although the physical main
 
Search WWH ::




Custom Search