Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

■ Relatively small size of GPU memory . A commonsense use of faster computation is to solve

bigger problems, and bigger problems often have a larger memory footprint. This GPU in-

consistency between speed and size can be addressed with more memory capacity. The

challenge is to maintain high bandwidth while increasing capacity.

■ Direct I/O to GPU memory . Real programs do I/O to storage devices as well as to frame buf-

fers, and large programs can require a lot of I/O as well as a sizeable memory. Today's GPU

systems must transfer between I/O devices and system memory and then between system

memory and GPU memory. This extra hop significantly lowers I/O performance in some

programs, making GPUs less atractive. Amdahl's law warns us what happens when you

neglect one piece of the task while accelerating others. We expect that future GPUs will

make all I/O first-class citizens, just as it does for frame buffer I/O today.

■ Unified physical memories . An alternative solution to the prior two bullets is to have a single

physical memory for the system and GPU, just as some inexpensive GPUs do for PMDs

and laptops. The AMD Fusion architecture, announced just as this edition was being in-

ished, is an initial merger between traditional GPUs and traditional CPUs. NVIDIA also

announced Project Denver, which combines an ARM scalar processor with NVIDIA GPUs

in a single address space. When these systems are shipped, it will be interesting to learn

just how tightly integrated they are and the impact of integration on performance and en-

ergy of both data parallel and graphics applications.

Having covered the many versions of SIMD, the next chapter dives into the realm of MIMD.

4.10 Historical Perspective and References

Section L.6 (available online) features a discussion on the Illiac IV (a representative of the early

SIMD architectures) and the Cray-1 (a representative of vector architectures). We also look at

multimedia SIMD extensions and the history of GPUs.

Case Study and Exercises by Jason D. Bakos

Case Study: Implementing A Vector Kernel On A Vector

Processor And GPU

Concepts illustrated by this case study

■ Programming Vector Processors

■ Programming GPUs

■ Performance Estimation

MrBayes is a popular and well-known computational biology application for inferring the

evolutionary histories among a set of input species based on their multiply-aligned DNA se-

quence data of length n . MrBayes works by performing a heuristic search over the space of

all binary tree topologies for which the inputs are the leaves. In order to evaluate a particular

tree, the application must compute an n × 4 conditional likelihood table (named clP) for each

interior node. The table is a function of the conditional likelihood tables of the node's two des-

cendent nodes ( clL and clR , single precision floating point) and their associated n × 4 × 4 trans-

ition probability tables ( tiPL and tiPR , single precision floating point). One of this application's

kernels is the computation of this conditional likelihood table and is shown below:

for (k=0; k<seq_length; k++) {

clP[h++] = (tiPL[AA]*clL[A] + tiPL[AC]*clL[C] + tiPL[AG]*clL[G] + tiPL[AT]*clL[T])

Search WWH ::

Custom Search

Home