Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Example

The largest configuration of a Cray T90 (Cray T932) has 32 processors, each cap-

able of generating 4 loads and 2 stores per clock cycle. The processor clock cycle

is 2.167 ns, while the cycle time of the SRAMs used in the memory system is 15

ns. Calculate the minimum number of memory banks required to allow all pro-

cessors to run at full memory bandwidth.

Answer

The maximum number of memory references each cycle is 192: 32 processors

times 6 references per processor. Each SRAM bank is busy for 15/2.167 = 6.92

clock cycles, which we round up to 7 processor clock cycles. Therefore, we re-

quire a minimum of 192 × 7 = 1344 memory banks!

The Cray T932 actually has 1024 memory banks, so the early models could

not sustain full bandwidth to all processors simultaneously. A subsequent

memory upgrade replaced the 15 ns asynchronous SRAMs with pipelined syn-

chronous SRAMs that more than halved the memory cycle time, thereby

providing sufficient bandwidth.

Taking a higher level perspective, vector load/store units play a similar role to prefetch units

in scalar processors in that both try to deliver data bandwidth by supplying processors with

streams of data.

Stride:

Handling

Multidimensional

Arrays

In

Vector

Architectures

The position in memory of adjacent elements in a vector may not be sequential. Consider this

straightforward code for matrix multiply in C:

for (i = 0; i < 100; i=i+1)

for (j = 0; j < 100; j=j+1) {

A[i][j] = 0.0;

for (k = 0; k < 100; k=k+1)

A[i][j] = A[i][j] + B[i][k] * D[k][j];

}

We could vectorize the multiplication of each row of B with each column of D and strip-mine

the inner loop with k as the index variable.

To do so, we must consider how to address adjacent elements in B and adjacent elements in

D . When an array is allocated memory, it is linearized and must be laid out in either row-major

(as in C) or column-major (as in Fortran) order. This linearization means that either the ele-

ments in the row or the elements in the column are not adjacent in memory. For example, the

C code above allocates in row-major order, so the elements of D that are accessed by iterations

in the inner loop are separated by the row size times 8 (the number of bytes per entry) for a

total of 800 bytes. In Chapter 2 , we saw that blocking could improve locality in cache-based

systems. For vector processors without caches, we need another technique to fetch elements of

a vector that are not adjacent in memory.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home