Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

any function can be used as a reduction operator, and common cases include operators such

as max and min.

Reductions are sometimes handled by special hardware in a vector and SIMD architecture

that allows the reduce step to be done much faster than it could be done in scalar mode. These

work by implementing a technique similar to what can be done in a multiprocessor environ-

ment. While the general transformation works with any number of processors, suppose for

simplicity we have 10 processors. In the first step of reducing the sum, each processor executes

the following (with p as the processor number ranging from 0 to 9):

for (i=999; i>=0; i=i−1)

finalsum[p] = finalsum[p] + sum[i+1000*p];

This loop, which sums up 1000 elements on each of the ten processors, is completely par-

allel. A simple scalar loop can then complete the summation of the last ten sums. Similar ap-

proaches are used in vector and SIMD processors.

It is important to observe that the above transformation relies on associativity of addition.

Although arithmetic with unlimited range and precision is associative, computer arithmetic is

not associative, for either integer arithmetic, because of limited range, or floating-point arith-

metic, because of both range and precision. Thus, using these restructuring techniques can

sometimes lead to erroneous behavior, although such occurrences are rare. For this reason,

most compilers require that optimizations that rely on associativity be explicitly enabled.

4.6 Crosscutting Issues

Energy And DLP: Slow And Wide Versus Fast And Narrow

A fundamental energy advantage of data-level parallel architectures comes from the energy

equation in Chapter 1 . Since we assume ample data-level parallelism, the performance is the

same if we halve the clock rate and double the execution resources: twice the number of lanes

for a vector computer, wider registers and ALUs for multimedia SIMD, and more SIMD lanes

for GPUs. If we can lower the voltage while dropping the clock rate, we can actually reduce

energy as well as the power for the computation while maintaining the same peak perform-

ance. Hence, DLP processors tend to have lower clock rates than system processors, which rely

on high clock rates for performance (see Section 4.7 ) .

Compared to out-of-order processors, DLP processors can have simpler control logic to

launch a large number of operations per clock cycle; for example, the control is identical for all

lanes in vector processors, and there is no logic to decide on multiple instruction issue or spec-

ulative execution logic. Vector architectures can also make it easier to turn of unused portions

of the chip. Each vector instruction explicitly describes all the resources it needs for a number

of cycles when the instruction issues.

Banked Memory And Graphics Memory

Section 4.2 noted the importance of substantial memory bandwidth for vector architectures to

support unit stride, non-unit stride, and gather-scater accesses.

To achieve their high performance, GPUs also require substantial memory bandwidth. Spe-

cial DRAM chips designed just for GPUs, called GDRAM for graphics DRAM , help deliver this

bandwidth. GDRAM chips have higher bandwidth often at lower capacity than conventional

DRAM chips. To deliver this bandwidth, GDRAM chips are often soldered directly onto the

same board as the GPU rather than being placed into DIMM modules that are inserted into

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home