Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

■ Cache benefits. . Ray casting (RC) is only 1.6× faster on the GTX because cache blocking with

the Core i7 caches prevents it from becoming memory bandwidth bound, as it is on GPUs.

Cache blocking can help Search, too. If the index trees are small so that they it in the cache,

the Core i7 is twice as fast. Larger index trees make them memory bandwidth bound.

Overall, the GTX 280 runs search 1.8× faster. Cache blocking also helps Sort. While most

programmers wouldn't run Sort on a SIMD processor, it can be written with a 1-bit Sort

primitive called split . However, the split algorithm executes many more instructions than

a scalar sort does. As a result, the GTX 280 runs only 0.8× as fast as the Core i7. Note that

caches also help other kernels on the Core i7, since cache blocking allows SGEMM, FFT,

and SpMV to become compute bound. This observation re-emphasizes the importance of

cache blocking optimizations in Chapter 2 . (It would be interesting to see how caches of

the Fermi GTX 480 will affect the six kernels mentioned in this paragraph.)

■ Gather-Scater . The multimedia SIMD extensions are of litle help if the data are scatered

throughout main memory; optimal performance comes only when data are aligned on

16-byte boundaries. Thus, GJK gets litle beneit from SIMD on the Core i7. As mentioned

above, GPUs ofer gather-scater addressing that is found in a vector architecture but omit-

ted from SIMD extensions. The address coalescing unit helps as well by combining ac-

cesses to the same DRAM line, thereby reducing the number of gathers and scaters. The

memory controller also batches accesses to the same DRAM page together. This combina-

tion means the GTX 280 runs GJK a startling 15.2× faster than the Core i7, which is larger

than any single physical parameter in Figure 4.27 . This observation reinforces the import-

ance of gather-scater to vector and GPU architectures that is missing from SIMD exten-

sions.

■ Synchronization . The performance synchronization of is limited by atomic updates, which

are responsible for 28% of the total runtime on the Core i7 despite its having a hardware

fetch-and-increment instruction. Thus, Hist is only 1.7× faster on the GTX 280. As men-

tioned above, the atomic updates of the Fermi GTX 480 are 5 to 20× faster than those of the

Tesla GTX 280, so once again it would be interesting to run Hist on the newer GPU. Solv

solves a batch of independent constraints in a small amount of computation followed by

barrier synchronization. The Core i7 benefits from the atomic instructions and a memory

consistency model that ensures the right results even if not all previous accesses to memory

hierarchy have completed. Without the memory consistency model, the GTX 280 version

launches some batches from the system processor, which leads to the GTX 280 running 0.5×

as fast as the Core i7. This observation points out how synchronization performance can be

important for some data parallel problems.

It is striking how often weaknesses in the Tesla GTX 280 that were uncovered by kernels

selected by Intel researchers were already being addressed in the successor architecture to

Tesla: Fermi has faster double-precision floating-point performance, atomic operations, and

caches. (In a related study, IBM researchers made the same observation [Bordawekar 2010].)

It was also interesting that the gather-scater support of vector architectures that predate the

SIMD instructions by decades was so important to the effective usefulness of these SIMD ex-

tensions, which some had predicted before the comparison [ Gebis and Paterson 2007 ] The In-

tel researchers noted that 6 of the 14 kernels would exploit SIMD beter with more eicient

gather-scatter support on the Core i7. This study certainly establishes the importance of cache

blocking as well. It will be interesting to see if future generations of the multicore and GPU

hardware, compilers, and libraries respond with features that improve performance on such

kernels.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home