Graphics Reference
In-Depth Information
Because these per-element operations had no dependencies (i.e., the sum of
any pair of vector elements is independent of the values and sums at all other pairs
of vector elements) the 64 operations specified by a vector instruction could all be
computed at the same time. While such an implementation was possible in prin-
ciple, it was impractical given the circuit densities that could be achieved using
the emitter-coupled transistor logic (ECL) that was employed. Indeed, had such
an implementation been possible, its peak performance would have approached
64
80,000,000 = 5,120 MFLOPS, more than 20 times the peak performance
that was actually achieved. Instead, the architectural parallelism of equivalent
operations on 64 data pairs was implemented as virtual parallelism—the 64 oper-
ations were computed sequentially using a single arithmetic circuit.
×
a
b
c
d
0
1
2
3
The roughly 3 x improvement in peak arithmetic performance over scalar per-
formance was instead achieved using a parallelism technique called pipelining.
Two specific circuit approaches were employed. First, because floating-point oper-
ations are too complex to be computed by a single ECL circuit in 12.5 ns, the
floating-point circuits were divided into stages: 6 for addition, 7 for multiplication,
and 14 for reciprocation. Each stage performed a portion of the operation in a sin-
gle cycle, then forwarded the partial result to the next stage. Thus, floating-point
operations were organized into pipelines of sequential stages (e.g., align operands,
check for overflow), with the stages operating in parallel and the final stage of
each unit producing a result each cycle. The reduced complexity of the individ-
ual stages allowed the 12.5 ns cycle time to be achieved, and therefore enabled
sustained scalar performance of 80 MFLOPS. Second, a specialized pipelining
mechanism called chaining allowed the results of one vector instruction to be
used as input to a second vector instruction immediately, as they were computed,
rather than waiting for the 64 operations of the first vector instruction to be com-
pleted. As illustrated in Figure 38.6, this allowed small compound operations on
vectors, such as a
60
61
62
63
b [ n ]
c [ n ]
1
a [ n
2
6]
( b
1
c )[ n
2
6]
*
( a *( b
1
c ))[ n
2
13]
Figure 38.6: Chained evaluation
of the vector expression d = a ×
( b + c ) on the CRAY-1 supercom-
puter. The floating-point addition
unit takes six pipelined steps to
compute its result; the multiplica-
tion unit takes seven.
( b + c ) , to be computed in little more time than was required
for a single vector operation. In the best case, with all three floating-point proces-
sors active, the combination of stage pipelining and operation chaining allows a
performance of 250 MFLOPS to be sustained.
Another important characterization of parallelism is into task and data par-
allelism. Data parallelism is the special case of performing the same operation
on equivalently structured, but distinct, data elements. The CRAY-1 vector instruc-
tions specify data-parallel operation: The same operation is performed on up to 64
floating-point operand pairs. Task parallelism is the general case of performing
two or more distinct operations on individual data sets. Pipeline parallelism, such
as the CRAY-1 floating-point circuit stages and operation chaining, is a specific
organization of task parallelism. Other examples of task parallelism include mul-
tiple threads in a concurrent program, multiple processes running on an operating
system, and, indeed, multiple operating systems running on a virtual machine.
GPU parallelism can also be characterized using these distinctions. GPU
architecture (see Figure 38.3) is a task-parallel pipeline. The GeForce 9800 GTX
implementation of this pipeline is a combination of true pipeline parallelism and
virtual pipeline parallelism. Fixed-function vertex, primitive, and fragment gen-
eration stages are implemented with separate circuits—they are examples of true
parallelism. Programmable vertex, primitive, and fragment processing stages are
implemented with a single computation engine that is shared among these distinct
tasks. Virtualization allows this expensive computation engine, which occupies a
significant fraction of the GPU's circuitry, to be allocated dynamically, based on
×
 
 
Search WWH ::




Custom Search