Hardware Reference
In-Depth Information
slots on a board, as is the case for system memory. DIMM modules allow for much greater
capacity and for the system to be upgraded, unlike GDRAM. This limited capacity—about 4
GB in 2011—is in conflict with the goal of running bigger problems, which is a natural use of
the increased computational power of GPUs.
To deliver the best possible performance, GPUs try to take into account all the features of
GDRAMs. They are typically arranged internally as 4 to 8 banks, with a power of 2 number
of rows (typically 16,384) and a power of 2 number of bits per row (typically 8192). Chapter 2
describes the details of DRAM behavior that GPUs try to match.
Given all the potential demands on the GDRAMs from both the computation tasks and the
graphics acceleration tasks, the memory system could see a large number of uncorrelated re-
quests. Alas, this diversity hurts memory performance. To cope, the GPU's memory controller
maintains separate queues of traffic bound for different GDRAM banks, waiting until there is
enough traffic to justify opening a row and transferring all requested data at once. This delay
improves bandwidth but stretches latency, and the controller must ensure that no process-
ing units starve while waiting for data, for otherwise neighboring processors could become
idle. Section 4.7 shows that gather-scater techniques and memory-bank-aware access tech-
niques can deliver substantial increases in performance versus conventional cache-based ar-
chitectures.
Strided Accesses And TLB Misses
One problem with strided accesses is how they interact with the translation lookaside bufer
(TLB) for virtual memory in vector architectures or GPUs. (GPUs use TLBs for memory map-
ping.) Depending on how the TLB is organized and the size of the array being accessed in
memory, it is even possible to get one TLB miss for every access to an element in the array!
4.7 Putting It All Together: Mobile versus Server GPUs
and Tesla versus Core i7
Given the popularity of graphics applications, GPUs are now found in both mobile clients as
well as traditional servers or heavy-duty desktop computers. Figure 4.26 lists the key charac-
teristics of the NVIDIA Tegra 2 for mobile clients, which is used in the LG Optimus 2X and
runs Android OS, and the Fermi GPU for servers. GPU server engineers hope to be able to do
live animation within five years after a movie is released. GPU mobile engineers in turn want
within five more years that a mobile client can do what a server or game console does today.
More concretely, the overarching goal is for the graphics quality of a movie such as Avatar to
be achieved in real time on a server GPU in 2015 and on your mobile GPU in 2020.
 
Search WWH ::




Custom Search