Graphics Reference
In-Depth Information
the instantaneous requirements of the running application. A true-parallel imple-
mentation would assign fixed amounts of circuitry to these pipeline tasks, and thus
would be efficient only for applications that tuned their vertex, primitive, and frag-
ment workloads to the static, circuit-specified allocations. Such a static allocation
is forced for the fixed-function stages, but these circuits occupy a small fraction of
the overall GPU, so they can be overprovisioned without significantly increasing
the cost of the GPU.
Task parallelism, in the form of pipeline stages, is only the top level of a hier-
archy of parallelism in the GeForce 9800 GTX implementation, which continues
down to small groups of transistors. The next lower layers of parallelism are the
16 task-parallel computing cores, each of which is implemented as a data-parallel,
16-wide vector processor. Unlike the CRAY-1, whose vector implementation was
virtual parallel, the GeForce 9800 GTX vector implementation is a hybrid of vir-
tual and true parallel—there is separate circuitry for eight vector data paths, each
of which is used twice to operate on a 16-wide vector. (This parallel circuitry is
represented by the eight data paths [DPs] in each of the 16 cores of Figure 38.4.)
A true-parallel vector implementation is sometimes referred to as SIMD (Single
Instruction Multiple Data) because a single decoded instruction is applied to each
vector element. SIMD cores are desirable because they yield much more computa-
tion (GFLOPS) per unit area of silicon than scalar (SISD) cores do. And compute
rate is a high priority for GPU implementations.
It is important to understand that SIMD vector implementation of the GPU
cores is not revealed directly in the GPU architecture (as it is in the CRAY-1 vector
instructions). While the GPU programming model executes the same program
for each element (e.g., each vertex), that vertex program may contain branches,
and the branches may be taken differently for each vertex. The implementation
includes extra circuitry that implements predication. When all 16 vector data
elements take the same branch path, operation continues at full efficiency. A single
branch taken differently splits the vector data elements into two groups: those
that take one branch path and those that take the other. Both paths are executed,
one after the other, with execution suppressed for elements that don't belong to
the path being executed. (Figure 38.9 in Section 38.7.3 illustrates diverging and
nondiverging predicated execution.) Nested branches may further split the groups.
In the limit separate execution could be required for each element, but this limit
is rarely reached. Thus, a Single Program Multiple Data (SPMD) architecture is
implemented with predicated SIMD computation cores.
38.5 Programmability
We begin our discussion of programmability by examining a simple coding
example. Listing 38.1 is a complete fragment processing program written in
the Direct3D High-Level Shading Language (HLSL). This program specifies the
operations that are applied to each pixel fragment 5 resulting from the rasterization
5. Use of the term “fragment” to distinguish the data structures generated by rasterization
(and operated upon by the fragment processing pipeline stage) from the data structures
stored in the framebuffer (pixels) was established by OpenGL in 1992 and has since
become an accepted industry standard. Microsoft Direct3D and HLSL blur this dis-
tinction by referring to both data structures as pixels . The confusion is compounded by
the HLSL use of “fragment” to refer to a separable piece of HLSL code. The reader is
warned.
 
Search WWH ::




Custom Search