Hardware Reference
In-Depth Information
any function can be used as a reduction operator, and common cases include operators such
as max and min.
Reductions are sometimes handled by special hardware in a vector and SIMD architecture
that allows the reduce step to be done much faster than it could be done in scalar mode. These
work by implementing a technique similar to what can be done in a multiprocessor environ-
ment. While the general transformation works with any number of processors, suppose for
simplicity we have 10 processors. In the first step of reducing the sum, each processor executes
the following (with p as the processor number ranging from 0 to 9):
for (i=999; i>=0; i=i−1)
finalsum[p] = finalsum[p] + sum[i+1000*p];
This loop, which sums up 1000 elements on each of the ten processors, is completely par-
allel. A simple scalar loop can then complete the summation of the last ten sums. Similar ap-
proaches are used in vector and SIMD processors.
It is important to observe that the above transformation relies on associativity of addition.
Although arithmetic with unlimited range and precision is associative, computer arithmetic is
not associative, for either integer arithmetic, because of limited range, or floating-point arith-
metic, because of both range and precision. Thus, using these restructuring techniques can
sometimes lead to erroneous behavior, although such occurrences are rare. For this reason,
most compilers require that optimizations that rely on associativity be explicitly enabled.
4.6 Crosscutting Issues
Energy And DLP: Slow And Wide Versus Fast And Narrow
A fundamental energy advantage of data-level parallel architectures comes from the energy
equation in Chapter 1 . Since we assume ample data-level parallelism, the performance is the
same if we halve the clock rate and double the execution resources: twice the number of lanes
for a vector computer, wider registers and ALUs for multimedia SIMD, and more SIMD lanes
for GPUs. If we can lower the voltage while dropping the clock rate, we can actually reduce
energy as well as the power for the computation while maintaining the same peak perform-
ance. Hence, DLP processors tend to have lower clock rates than system processors, which rely
on high clock rates for performance (see Section 4.7 ) .
Compared to out-of-order processors, DLP processors can have simpler control logic to
launch a large number of operations per clock cycle; for example, the control is identical for all
lanes in vector processors, and there is no logic to decide on multiple instruction issue or spec-
ulative execution logic. Vector architectures can also make it easier to turn of unused portions
of the chip. Each vector instruction explicitly describes all the resources it needs for a number
of cycles when the instruction issues.
Banked Memory And Graphics Memory
Section 4.2 noted the importance of substantial memory bandwidth for vector architectures to
support unit stride, non-unit stride, and gather-scater accesses.
To achieve their high performance, GPUs also require substantial memory bandwidth. Spe-
cial DRAM chips designed just for GPUs, called GDRAM for graphics DRAM , help deliver this
bandwidth. GDRAM chips have higher bandwidth often at lower capacity than conventional
DRAM chips. To deliver this bandwidth, GDRAM chips are often soldered directly onto the
same board as the GPU rather than being placed into DIMM modules that are inserted into
 
Search WWH ::




Custom Search