Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

for (i=0;i<64;i++) dot[i] = a[i] * b[i];

for (i=1;i<64;i++) dot[0] = dot[0] + dot[i];

As mentioned in Section 4.5 , if we allow the floating-point addition to be associative, there

are several techniques available for parallelizing the reduction.

a. [15] <4.4, 4.5> One technique is called recurrence doubling, which adds sequences of

progressively shorter vectors (i.e., two 32-element vectors, then two 16-element vec-

tors, and so on). Show how the C code would look for executing the second loop in

this way.

b. [25] <4.4, 4.5> In some vector processors, the individual elements within the vector re-

gisters are addressable. In this case, the operands to a vector operation may be two

diferent parts of the same vector register. This allows another solution for the reduc-

tion called partial sums . The idea is to reduce the vector to m sums where m is the

total latency through the vector functional unit, including the operand read and write

times. Assume that the VMIPS vector registers are addressable (e.g., you can initiate

a vector operation with the operand V1(16) , indicating that the input operand begins

with element 16). Also, assume that the total latency for adds, including the operand

read and result write, is eight cycles. Write a VMIPS code sequence that reduces the

contents of V1 to eight partial sums.

c. [25] <4.4, 4.5> When performing a reduction on a GPU, one thread is associated with

each element in the input vector. The first step is for each thread to write its corres-

ponding value into shared memory. Next, each thread enters a loop that adds each

pair of input values. This reduces the number of elements by half after each iteration,

meaning that the number of active threads also reduces by half after each iteration. In

order to maximize the performance of the reduction, the number of fully populated

warps should be maximized throughout the course of the loop. In other words, the

active threads should be contiguous. Also, each thread should index the shared array

in such a way as to avoid bank conflicts in the shared memory. The following loop

violates only the first of these guidelines and also uses the modulo operator which is

very expensive for GPUs:

unsigned int tid = threadIdx.x;

for(unsigned int s=1; s < blockDim.x; s *= 2) {

if ((tid % (2*s)) == 0) {

sdata[tid] += sdata[tid + s];

}

__syncthreads();

}

Rewrite the loop to meet these guidelines and eliminate the use of the modulo oper-

ator. Assume that there are 32 threads per warp and a bank conflict occurs whenever

two or more threads from the same warp reference an index whose modulo by 32 are

equal.

4.12 [10/10/10/10] <4.3> The following kernel performs a portion of the finite-difference time-

domain (FDTD) method for computing Maxwell's equations in a three-dimensional space,

part of one of the SPEC06fp benchmarks:

for (int x=0; x<NX−1; x++) {

for (int y=0; y<NY−1; y++) {

for (int z=0; z<NZ−1; z++) {

int index = x*NY*NZ + y*NZ + z;

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home