Graphics Reference
In-Depth Information
3. Constrained access: A coherent view of memory is enforced by con-
straining how memory is accessed, rather than by adding complexity to the
memory hierarchy implementation. This is the approach taken by GPUs as
they are exposed by the Direct3D and OpenGL interfaces. Memory that is
shared, such as texture images, is read-only, so no inconsistencies are pos-
sible. Memory that is written, such as framebuffer memory, is write-only
from the viewpoint of the parallel cores—read-modify-write framebuffer
operations, such as depth buffering and blending, are implemented by ded-
icated “pixel ops” circuits that operate directly on framebuffer physical
memory (see Figure 38.4). So again, no inconsistency is possible.
Cache memory is a central concern in computer architecture, and we have only
scratched the surface in this short section. Rich topics, such as eviction policy and
set associativity, have not been covered at all. Interested readers are encouraged to
peruse the list of suggested readings at the end of this chapter. Be warned, how-
ever, that most of the literature is written from the perspective of CPU architects,
whose experience and consequent world views differ from those of GPU archi-
tects.
38.7.3 Divergence
We learned in Section 38.4 that GPUs such as the GeForce 9800 GTX implement
a SPMD programming model with SIMD processing cores. The single- program,
multiple-data programming model allows shaders to be written as though each
was executed individually, greatly simplifying the programmer's job. The single-
instruction, multiple-data implementation collects elements (e.g., vertices, primi-
tives, or pixel fragments) into short vectors, which are executed in parallel by data
paths that share a single instruction sequence unit. (The GeForce 9800 GTX has
16 cores, each with an effective vector length of 16 elements.)
The motivation for SIMD implementation is efficiency: Sharing a single
instruction stream among multiple data paths allows more data paths to be imple-
mented in the same silicon area and reduces instruction-fetch bandwidth per data
path. For example, if a core's instruction sequence unit occupies the same sili-
con area as one data path, then a 16-wide true-parallel SIMD core occupies just
over half (17
32) the area required by 16 SISD cores, almost doubling peak per-
formance per unit silicon area. But this efficiency is achieved only when the ele-
ments assigned to a vector require the same sequence of instructions. When dif-
ferent sequences are required—that is, when the instruction sequences diverge
efficiency is lost.
When GPU shader programming was first exposed in OpenGL and Direct3D,
shaders had no conditional branch instructions. Each instance of such a shader
executed the same sequence of instructions, regardless of the element data being
operated on, so divergence was limited to vectors that straddled a change made to
a shader. Then, as now, GPU architectures encouraged operations on large blocks
of data (e.g., Direct3D vertex buffers and OpenGL vertex arrays), during which
no changes can be made to shaders. And GPU implementations typically packed
SIMD vectors first-come-first-served, just as skiers are loaded onto lift chairs. So
vectors that straddled changes in shaders were unusual, and divergence was not a
significant problem.
Modern GPUs do support conditional branching in shaders, however, and its
use by programmers does increase divergence. In the worst case, when each vector
/
 
Search WWH ::




Custom Search