Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

As we shall see in Section 4.4 , all loads are gathers and all stores are scaters in GPUs. To

avoid running slowly in the frequent case of unit strides, it is up to the GPU programmer to

ensure that all the addresses in a gather or scater are to adjacent locations. In addition, the

GPU hardware must recognize the sequence of these addresses during execution to turn the

gathers and scaters into the more eicient unit stride accesses to memory.

Programming Vector Architectures

An advantage of vector architectures is that compilers can tell programmers at compile time

whether a section of code will vectorize or not, often giving hints as to why it did not vectorize

the code. This straightforward execution model allows experts in other domains to learn how

to improve performance by revising their code or by giving hints to the compiler when it's OK

to assume independence between operations, such as for gather-scater data transfers. It is this

dialog between the compiler and the programmer, with each side giving hints to the other on

how to improve performance, that simplifies programming of vector computers.

Today, the main factor that affects the success with which a program runs in vector mode is

the structure of the program itself: Do the loops have true data dependences (see Section 4.5 ),

or can they be restructured so as not to have such dependences? This factor is influenced by

the algorithms chosen and, to some extent, by how they are coded.

As an indication of the level of vectorization achievable in scientific programs, let's look at

the vectorization levels observed for the Perfect Club benchmarks. Figure 4.7 shows the per-

centage of operations executed in vector mode for two versions of the code running on the

Cray Y-MP. The first version is that obtained with just compiler optimization on the original

code, while the second version uses extensive hints from a team of Cray Research program-

mers. Several studies of the performance of applications on vector processors show a wide

variation in the level of compiler vectorization.

Search WWH ::

Custom Search

Home