Modern Graphics Hardware - Computer Graphics: Principles and Practice

Graphics Reference

In-Depth Information

Because these per-element operations had no dependencies (i.e., the sum of

any pair of vector elements is independent of the values and sums at all other pairs

of vector elements) the 64 operations specified by a vector instruction could all be

computed at the same time. While such an implementation was possible in prin-

ciple, it was impractical given the circuit densities that could be achieved using

the emitter-coupled transistor logic (ECL) that was employed. Indeed, had such

an implementation been possible, its peak performance would have approached

80,000,000 = 5,120 MFLOPS, more than 20 times the peak performance

that was actually achieved. Instead, the architectural parallelism of equivalent

operations on 64 data pairs was implemented as virtual parallelism—the 64 oper-

ations were computed sequentially using a single arithmetic circuit.

The roughly 3 x improvement in peak arithmetic performance over scalar per-

formance was instead achieved using a parallelism technique called pipelining.

Two specific circuit approaches were employed. First, because floating-point oper-

ations are too complex to be computed by a single ECL circuit in 12.5 ns, the

floating-point circuits were divided into stages: 6 for addition, 7 for multiplication,

and 14 for reciprocation. Each stage performed a portion of the operation in a sin-

gle cycle, then forwarded the partial result to the next stage. Thus, floating-point

operations were organized into pipelines of sequential stages (e.g., align operands,

check for overflow), with the stages operating in parallel and the final stage of

each unit producing a result each cycle. The reduced complexity of the individ-

ual stages allowed the 12.5 ns cycle time to be achieved, and therefore enabled

sustained scalar performance of 80 MFLOPS. Second, a specialized pipelining

mechanism called chaining allowed the results of one vector instruction to be

used as input to a second vector instruction immediately, as they were computed,

rather than waiting for the 64 operations of the first vector instruction to be com-

pleted. As illustrated in Figure 38.6, this allowed small compound operations on

vectors, such as a

b [ n ]

c [ n ]

a [ n

( b

c )[ n

( a *( b

c ))[ n

13]

Figure 38.6: Chained evaluation

of the vector expression d = a ×

( b + c ) on the CRAY-1 supercom-

puter. The floating-point addition

unit takes six pipelined steps to

compute its result; the multiplica-

tion unit takes seven.

( b + c ) , to be computed in little more time than was required

for a single vector operation. In the best case, with all three floating-point proces-

sors active, the combination of stage pipelining and operation chaining allows a

performance of 250 MFLOPS to be sustained.

Another important characterization of parallelism is into task and data par-

allelism. Data parallelism is the special case of performing the same operation

on equivalently structured, but distinct, data elements. The CRAY-1 vector instruc-

tions specify data-parallel operation: The same operation is performed on up to 64

floating-point operand pairs. Task parallelism is the general case of performing

two or more distinct operations on individual data sets. Pipeline parallelism, such

as the CRAY-1 floating-point circuit stages and operation chaining, is a specific

organization of task parallelism. Other examples of task parallelism include mul-

tiple threads in a concurrent program, multiple processes running on an operating

system, and, indeed, multiple operating systems running on a virtual machine.

GPU parallelism can also be characterized using these distinctions. GPU

architecture (see Figure 38.3) is a task-parallel pipeline. The GeForce 9800 GTX

implementation of this pipeline is a combination of true pipeline parallelism and

virtual pipeline parallelism. Fixed-function vertex, primitive, and fragment gen-

eration stages are implemented with separate circuits—they are examples of true

parallelism. Programmable vertex, primitive, and fragment processing stages are

implemented with a single computation engine that is shared among these distinct

tasks. Virtualization allows this expensive computation engine, which occupies a

significant fraction of the GPU's circuitry, to be allocated dynamically, based on

Computer Graphics: Principles and Practice

Search WWH ::

Custom Search

Home