Hardware Reference
In-Depth Information
As we shall see in Section 4.4 , all loads are gathers and all stores are scaters in GPUs. To
avoid running slowly in the frequent case of unit strides, it is up to the GPU programmer to
ensure that all the addresses in a gather or scater are to adjacent locations. In addition, the
GPU hardware must recognize the sequence of these addresses during execution to turn the
gathers and scaters into the more eicient unit stride accesses to memory.
Programming Vector Architectures
An advantage of vector architectures is that compilers can tell programmers at compile time
whether a section of code will vectorize or not, often giving hints as to why it did not vectorize
the code. This straightforward execution model allows experts in other domains to learn how
to improve performance by revising their code or by giving hints to the compiler when it's OK
to assume independence between operations, such as for gather-scater data transfers. It is this
dialog between the compiler and the programmer, with each side giving hints to the other on
how to improve performance, that simplifies programming of vector computers.
Today, the main factor that affects the success with which a program runs in vector mode is
the structure of the program itself: Do the loops have true data dependences (see Section 4.5 ),
or can they be restructured so as not to have such dependences? This factor is influenced by
the algorithms chosen and, to some extent, by how they are coded.
As an indication of the level of vectorization achievable in scientific programs, let's look at
the vectorization levels observed for the Perfect Club benchmarks. Figure 4.7 shows the per-
centage of operations executed in vector mode for two versions of the code running on the
Cray Y-MP. The first version is that obtained with just compiler optimization on the original
code, while the second version uses extensive hints from a team of Cray Research program-
mers. Several studies of the performance of applications on vector processors show a wide
variation in the level of compiler vectorization.
Search WWH ::




Custom Search