Modern Graphics Hardware - Computer Graphics: Principles and Practice

Graphics Reference

In-Depth Information

the instantaneous requirements of the running application. A true-parallel imple-

mentation would assign fixed amounts of circuitry to these pipeline tasks, and thus

would be efficient only for applications that tuned their vertex, primitive, and frag-

ment workloads to the static, circuit-specified allocations. Such a static allocation

is forced for the fixed-function stages, but these circuits occupy a small fraction of

the overall GPU, so they can be overprovisioned without significantly increasing

the cost of the GPU.

Task parallelism, in the form of pipeline stages, is only the top level of a hier-

archy of parallelism in the GeForce 9800 GTX implementation, which continues

down to small groups of transistors. The next lower layers of parallelism are the

16 task-parallel computing cores, each of which is implemented as a data-parallel,

16-wide vector processor. Unlike the CRAY-1, whose vector implementation was

virtual parallel, the GeForce 9800 GTX vector implementation is a hybrid of vir-

tual and true parallel—there is separate circuitry for eight vector data paths, each

of which is used twice to operate on a 16-wide vector. (This parallel circuitry is

represented by the eight data paths [DPs] in each of the 16 cores of Figure 38.4.)

A true-parallel vector implementation is sometimes referred to as SIMD (Single

Instruction Multiple Data) because a single decoded instruction is applied to each

vector element. SIMD cores are desirable because they yield much more computa-

tion (GFLOPS) per unit area of silicon than scalar (SISD) cores do. And compute

rate is a high priority for GPU implementations.

It is important to understand that SIMD vector implementation of the GPU

cores is not revealed directly in the GPU architecture (as it is in the CRAY-1 vector

instructions). While the GPU programming model executes the same program

for each element (e.g., each vertex), that vertex program may contain branches,

and the branches may be taken differently for each vertex. The implementation

includes extra circuitry that implements predication. When all 16 vector data

elements take the same branch path, operation continues at full efficiency. A single

branch taken differently splits the vector data elements into two groups: those

that take one branch path and those that take the other. Both paths are executed,

one after the other, with execution suppressed for elements that don't belong to

the path being executed. (Figure 38.9 in Section 38.7.3 illustrates diverging and

nondiverging predicated execution.) Nested branches may further split the groups.

In the limit separate execution could be required for each element, but this limit

is rarely reached. Thus, a Single Program Multiple Data (SPMD) architecture is

implemented with predicated SIMD computation cores.

38.5 Programmability

We begin our discussion of programmability by examining a simple coding

example. Listing 38.1 is a complete fragment processing program written in

the Direct3D High-Level Shading Language (HLSL). This program specifies the

operations that are applied to each pixel fragment 5 resulting from the rasterization

5. Use of the term “fragment” to distinguish the data structures generated by rasterization

(and operated upon by the fragment processing pipeline stage) from the data structures

stored in the framebuffer (pixels) was established by OpenGL in 1992 and has since

become an accepted industry standard. Microsoft Direct3D and HLSL blur this dis-

tinction by referring to both data structures as pixels . The confusion is compounded by

the HLSL use of “fragment” to refer to a separable piece of HLSL code. The reader is

warned.

Computer Graphics: Principles and Practice

Search WWH ::

Custom Search

Home