Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

SUBVV.D V1,V1,V2 ;subtract under vector mask

SV V1,Rx ;store the result in X

Compiler writers call the transformation to change an IF statement to a straight-line code

sequence using conditional execution if conversion .

Using a vector-mask register does have overhead, however. With scalar architectures, con-

ditionally executed instructions still require execution time when the condition is not satisied.

Nonetheless, the elimination of a branch and the associated control dependences can make a

conditional instruction faster even if it sometimes does useless work. Similarly, vector instruc-

tions executed with a vector mask still take the same execution time, even for the elements

where the mask is zero. Likewise, even with a significant number of zeros in the mask, using

vector-mask control may still be significantly faster than using scalar mode.

As we shall see in Section 4.4 , one difference between vector processors and GPUs is the

way they handle conditional statements. Vector processors make the mask registers part of the

architectural state and rely on compilers to manipulate mask registers explicitly. In contrast,

GPUs get the same effect using hardware to manipulate internal mask registers that are invis-

ible to GPU software. In both cases, the hardware spends the time to execute a vector element

whether the mask is zero or one, so the GFLOPS rate drops when masks are used.

Memory Banks: Supplying Bandwidth For Vector Load/Store

Units

The behavior of the load/store vector unit is significantly more complicated than that of the

arithmetic functional units. The start-up time for a load is the time to get the first word from

memory into a register. If the rest of the vector can be supplied without stalling, then the vec-

tor initiation rate is equal to the rate at which new words are fetched or stored. Unlike simpler

functional units, the initiation rate may not necessarily be one clock cycle because memory

bank stalls can reduce effective throughput.

Typically, penalties for start-ups on load/store units are higher than those for arithmetic

units—over 100 clock cycles on many processors. For VMIPS we assume a start-up time of 12

clock cycles, the same as the Cray-1. (More recent vector computers use caches to bring down

latency of vector loads and stores.)

To maintain an initiation rate of one word fetched or stored per clock, the memory system

must be capable of producing or accepting this much data. Spreading accesses across multiple

independent memory banks usually delivers the desired rate. As we will soon see, having sig-

niicant numbers of banks is useful for dealing with vector loads or stores that access rows or

columns of data.

Most vector processors use memory banks, which allow multiple independent accesses

rather than simple memory interleaving for three reasons:

1. Many vector computers support multiple loads or stores per clock, and the memory bank

cycle time is usually several times larger than the processor cycle time. To support simul-

taneous accesses from multiple loads or stores, the memory system needs multiple banks

and to be able to control the addresses to the banks independently.

2. Most vector processors support the ability to load or store data words that are not sequen-

tial. In such cases, independent bank addressing, rather than interleaving, is required.

3. Most vector computers support multiple processors sharing the same memory system, so

each processor will be generating its own independent stream of addresses.

In combination, these features lead to a large number of independent memory banks, as the

following example shows.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home