Modern Graphics Hardware - Computer Graphics: Principles and Practice

Graphics Reference

In-Depth Information

3. Constrained access: A coherent view of memory is enforced by con-

straining how memory is accessed, rather than by adding complexity to the

memory hierarchy implementation. This is the approach taken by GPUs as

they are exposed by the Direct3D and OpenGL interfaces. Memory that is

shared, such as texture images, is read-only, so no inconsistencies are pos-

sible. Memory that is written, such as framebuffer memory, is write-only

from the viewpoint of the parallel cores—read-modify-write framebuffer

operations, such as depth buffering and blending, are implemented by ded-

icated “pixel ops” circuits that operate directly on framebuffer physical

memory (see Figure 38.4). So again, no inconsistency is possible.

Cache memory is a central concern in computer architecture, and we have only

scratched the surface in this short section. Rich topics, such as eviction policy and

set associativity, have not been covered at all. Interested readers are encouraged to

peruse the list of suggested readings at the end of this chapter. Be warned, how-

ever, that most of the literature is written from the perspective of CPU architects,

whose experience and consequent world views differ from those of GPU archi-

tects.

38.7.3 Divergence

We learned in Section 38.4 that GPUs such as the GeForce 9800 GTX implement

a SPMD programming model with SIMD processing cores. The single- program,

multiple-data programming model allows shaders to be written as though each

was executed individually, greatly simplifying the programmer's job. The single-

instruction, multiple-data implementation collects elements (e.g., vertices, primi-

tives, or pixel fragments) into short vectors, which are executed in parallel by data

paths that share a single instruction sequence unit. (The GeForce 9800 GTX has

16 cores, each with an effective vector length of 16 elements.)

The motivation for SIMD implementation is efficiency: Sharing a single

instruction stream among multiple data paths allows more data paths to be imple-

mented in the same silicon area and reduces instruction-fetch bandwidth per data

path. For example, if a core's instruction sequence unit occupies the same sili-

con area as one data path, then a 16-wide true-parallel SIMD core occupies just

over half (17

32) the area required by 16 SISD cores, almost doubling peak per-

formance per unit silicon area. But this efficiency is achieved only when the ele-

ments assigned to a vector require the same sequence of instructions. When dif-

ferent sequences are required—that is, when the instruction sequences diverge —

efficiency is lost.

When GPU shader programming was first exposed in OpenGL and Direct3D,

shaders had no conditional branch instructions. Each instance of such a shader

executed the same sequence of instructions, regardless of the element data being

operated on, so divergence was limited to vectors that straddled a change made to

a shader. Then, as now, GPU architectures encouraged operations on large blocks

of data (e.g., Direct3D vertex buffers and OpenGL vertex arrays), during which

no changes can be made to shaders. And GPU implementations typically packed

SIMD vectors first-come-first-served, just as skiers are loaded onto lift chairs. So

vectors that straddled changes in shaders were unusual, and divergence was not a

significant problem.

Modern GPUs do support conditional branching in shaders, however, and its

use by programmers does increase divergence. In the worst case, when each vector

/

Computer Graphics: Principles and Practice

Search WWH ::

Custom Search

Home