Hardware Reference
In-Depth Information
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV V1,Rx ;store the result in X
Compiler writers call the transformation to change an IF statement to a straight-line code
sequence using conditional execution if conversion .
Using a vector-mask register does have overhead, however. With scalar architectures, con-
ditionally executed instructions still require execution time when the condition is not satisied.
Nonetheless, the elimination of a branch and the associated control dependences can make a
conditional instruction faster even if it sometimes does useless work. Similarly, vector instruc-
tions executed with a vector mask still take the same execution time, even for the elements
where the mask is zero. Likewise, even with a significant number of zeros in the mask, using
vector-mask control may still be significantly faster than using scalar mode.
As we shall see in Section 4.4 , one difference between vector processors and GPUs is the
way they handle conditional statements. Vector processors make the mask registers part of the
architectural state and rely on compilers to manipulate mask registers explicitly. In contrast,
GPUs get the same effect using hardware to manipulate internal mask registers that are invis-
ible to GPU software. In both cases, the hardware spends the time to execute a vector element
whether the mask is zero or one, so the GFLOPS rate drops when masks are used.
Memory Banks: Supplying Bandwidth For Vector Load/Store
Units
The behavior of the load/store vector unit is significantly more complicated than that of the
arithmetic functional units. The start-up time for a load is the time to get the first word from
memory into a register. If the rest of the vector can be supplied without stalling, then the vec-
tor initiation rate is equal to the rate at which new words are fetched or stored. Unlike simpler
functional units, the initiation rate may not necessarily be one clock cycle because memory
bank stalls can reduce effective throughput.
Typically, penalties for start-ups on load/store units are higher than those for arithmetic
units—over 100 clock cycles on many processors. For VMIPS we assume a start-up time of 12
clock cycles, the same as the Cray-1. (More recent vector computers use caches to bring down
latency of vector loads and stores.)
To maintain an initiation rate of one word fetched or stored per clock, the memory system
must be capable of producing or accepting this much data. Spreading accesses across multiple
independent memory banks usually delivers the desired rate. As we will soon see, having sig-
niicant numbers of banks is useful for dealing with vector loads or stores that access rows or
columns of data.
Most vector processors use memory banks, which allow multiple independent accesses
rather than simple memory interleaving for three reasons:
1. Many vector computers support multiple loads or stores per clock, and the memory bank
cycle time is usually several times larger than the processor cycle time. To support simul-
taneous accesses from multiple loads or stores, the memory system needs multiple banks
and to be able to control the addresses to the banks independently.
2. Most vector processors support the ability to load or store data words that are not sequen-
tial. In such cases, independent bank addressing, rather than interleaving, is required.
3. Most vector computers support multiple processors sharing the same memory system, so
each processor will be generating its own independent stream of addresses.
In combination, these features lead to a large number of independent memory banks, as the
following example shows.
Search WWH ::




Custom Search