Hardware Reference
In-Depth Information
Cache benefits. . Ray casting (RC) is only 1.6× faster on the GTX because cache blocking with
the Core i7 caches prevents it from becoming memory bandwidth bound, as it is on GPUs.
Cache blocking can help Search, too. If the index trees are small so that they it in the cache,
the Core i7 is twice as fast. Larger index trees make them memory bandwidth bound.
Overall, the GTX 280 runs search 1.8× faster. Cache blocking also helps Sort. While most
programmers wouldn't run Sort on a SIMD processor, it can be written with a 1-bit Sort
primitive called split . However, the split algorithm executes many more instructions than
a scalar sort does. As a result, the GTX 280 runs only 0.8× as fast as the Core i7. Note that
caches also help other kernels on the Core i7, since cache blocking allows SGEMM, FFT,
and SpMV to become compute bound. This observation re-emphasizes the importance of
cache blocking optimizations in Chapter 2 . (It would be interesting to see how caches of
the Fermi GTX 480 will affect the six kernels mentioned in this paragraph.)
Gather-Scater . The multimedia SIMD extensions are of litle help if the data are scatered
throughout main memory; optimal performance comes only when data are aligned on
16-byte boundaries. Thus, GJK gets litle beneit from SIMD on the Core i7. As mentioned
above, GPUs ofer gather-scater addressing that is found in a vector architecture but omit-
ted from SIMD extensions. The address coalescing unit helps as well by combining ac-
cesses to the same DRAM line, thereby reducing the number of gathers and scaters. The
memory controller also batches accesses to the same DRAM page together. This combina-
tion means the GTX 280 runs GJK a startling 15.2× faster than the Core i7, which is larger
than any single physical parameter in Figure 4.27 . This observation reinforces the import-
ance of gather-scater to vector and GPU architectures that is missing from SIMD exten-
sions.
Synchronization . The performance synchronization of is limited by atomic updates, which
are responsible for 28% of the total runtime on the Core i7 despite its having a hardware
fetch-and-increment instruction. Thus, Hist is only 1.7× faster on the GTX 280. As men-
tioned above, the atomic updates of the Fermi GTX 480 are 5 to 20× faster than those of the
Tesla GTX 280, so once again it would be interesting to run Hist on the newer GPU. Solv
solves a batch of independent constraints in a small amount of computation followed by
barrier synchronization. The Core i7 benefits from the atomic instructions and a memory
consistency model that ensures the right results even if not all previous accesses to memory
hierarchy have completed. Without the memory consistency model, the GTX 280 version
launches some batches from the system processor, which leads to the GTX 280 running 0.5×
as fast as the Core i7. This observation points out how synchronization performance can be
important for some data parallel problems.
It is striking how often weaknesses in the Tesla GTX 280 that were uncovered by kernels
selected by Intel researchers were already being addressed in the successor architecture to
Tesla: Fermi has faster double-precision floating-point performance, atomic operations, and
caches. (In a related study, IBM researchers made the same observation [Bordawekar 2010].)
It was also interesting that the gather-scater support of vector architectures that predate the
SIMD instructions by decades was so important to the effective usefulness of these SIMD ex-
tensions, which some had predicted before the comparison [ Gebis and Paterson 2007 ] The In-
tel researchers noted that 6 of the 14 kernels would exploit SIMD beter with more eicient
gather-scatter support on the Core i7. This study certainly establishes the importance of cache
blocking as well. It will be interesting to see if future generations of the multicore and GPU
hardware, compilers, and libraries respond with features that improve performance on such
kernels.
Search WWH ::




Custom Search