Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 4.20 Block diagram of the multithreaded SIMD Processor of a Fermi GPU . Each

SIMD Lane has a pipelined floating-point unit, a pipelined integer unit, some logic for dispatch-

ing instructions and operands to these units, and a queue for holding results. The four Special

Function units (SFUs) calculate functions such as square roots, reciprocals, sines, and co-

sines.

Fermi introduces several innovations to bring GPUs much closer to mainstream system pro-

cessors than Tesla and previous generations of GPU architectures:

■ Fast Double-Precision Floating-Point Arithmetic —Fermi matches the relative double-precision

speed of conventional processors of roughly half the speed of single precision versus a

tenth the speed of single precision in the prior Tesla generation. That is, there is no order of

magnitude temptation to use single precision when the accuracy calls for double precision.

The peak double-precision performance grew from 78 GFLOP/sec in the predecessor GPU

to 515 GFLOP/sec when using multiply-add instructions.

Search WWH ::

Custom Search

Home