Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

slots on a board, as is the case for system memory. DIMM modules allow for much greater

capacity and for the system to be upgraded, unlike GDRAM. This limited capacity—about 4

GB in 2011—is in conflict with the goal of running bigger problems, which is a natural use of

the increased computational power of GPUs.

To deliver the best possible performance, GPUs try to take into account all the features of

GDRAMs. They are typically arranged internally as 4 to 8 banks, with a power of 2 number

of rows (typically 16,384) and a power of 2 number of bits per row (typically 8192). Chapter 2

describes the details of DRAM behavior that GPUs try to match.

Given all the potential demands on the GDRAMs from both the computation tasks and the

graphics acceleration tasks, the memory system could see a large number of uncorrelated re-

quests. Alas, this diversity hurts memory performance. To cope, the GPU's memory controller

maintains separate queues of traffic bound for different GDRAM banks, waiting until there is

enough traffic to justify opening a row and transferring all requested data at once. This delay

improves bandwidth but stretches latency, and the controller must ensure that no process-

ing units starve while waiting for data, for otherwise neighboring processors could become

idle. Section 4.7 shows that gather-scater techniques and memory-bank-aware access tech-

niques can deliver substantial increases in performance versus conventional cache-based ar-

chitectures.

Strided Accesses And TLB Misses

One problem with strided accesses is how they interact with the translation lookaside bufer

(TLB) for virtual memory in vector architectures or GPUs. (GPUs use TLBs for memory map-

ping.) Depending on how the TLB is organized and the size of the array being accessed in

memory, it is even possible to get one TLB miss for every access to an element in the array!

4.7 Putting It All Together: Mobile versus Server GPUs

and Tesla versus Core i7

Given the popularity of graphics applications, GPUs are now found in both mobile clients as

well as traditional servers or heavy-duty desktop computers. Figure 4.26 lists the key charac-

teristics of the NVIDIA Tegra 2 for mobile clients, which is used in the LG Optimus 2X and

runs Android OS, and the Fermi GPU for servers. GPU server engineers hope to be able to do

live animation within five years after a movie is released. GPU mobile engineers in turn want

within five more years that a mobile client can do what a server or game console does today.

More concretely, the overarching goal is for the graphics quality of a movie such as Avatar to

be achieved in real time on a server GPU in 2015 and on your mobile GPU in 2020.

Search WWH ::

Custom Search

Home