Hardware Reference
In-Depth Information
Example
The largest configuration of a Cray T90 (Cray T932) has 32 processors, each cap-
able of generating 4 loads and 2 stores per clock cycle. The processor clock cycle
is 2.167 ns, while the cycle time of the SRAMs used in the memory system is 15
ns. Calculate the minimum number of memory banks required to allow all pro-
cessors to run at full memory bandwidth.
Answer
The maximum number of memory references each cycle is 192: 32 processors
times 6 references per processor. Each SRAM bank is busy for 15/2.167 = 6.92
clock cycles, which we round up to 7 processor clock cycles. Therefore, we re-
quire a minimum of 192 × 7 = 1344 memory banks!
The Cray T932 actually has 1024 memory banks, so the early models could
not sustain full bandwidth to all processors simultaneously. A subsequent
memory upgrade replaced the 15 ns asynchronous SRAMs with pipelined syn-
chronous SRAMs that more than halved the memory cycle time, thereby
providing sufficient bandwidth.
Taking a higher level perspective, vector load/store units play a similar role to prefetch units
in scalar processors in that both try to deliver data bandwidth by supplying processors with
streams of data.
Stride:
Handling
Multidimensional
Arrays
In
Vector
Architectures
The position in memory of adjacent elements in a vector may not be sequential. Consider this
straightforward code for matrix multiply in C:
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j=j+1) {
A[i][j] = 0.0;
for (k = 0; k < 100; k=k+1)
A[i][j] = A[i][j] + B[i][k] * D[k][j];
}
We could vectorize the multiplication of each row of B with each column of D and strip-mine
the inner loop with k as the index variable.
To do so, we must consider how to address adjacent elements in B and adjacent elements in
D . When an array is allocated memory, it is linearized and must be laid out in either row-major
(as in C) or column-major (as in Fortran) order. This linearization means that either the ele-
ments in the row or the elements in the column are not adjacent in memory. For example, the
C code above allocates in row-major order, so the elements of D that are accessed by iterations
in the inner loop are separated by the row size times 8 (the number of bytes per entry) for a
total of 800 bytes. In Chapter 2 , we saw that blocking could improve locality in cache-based
systems. For vector processors without caches, we need another technique to fetch elements of
a vector that are not adjacent in memory.
Search WWH ::




Custom Search