Hardware Reference
In-Depth Information
for (i=0;i<64;i++) dot[i] = a[i] * b[i];
for (i=1;i<64;i++) dot[0] = dot[0] + dot[i];
As mentioned in Section 4.5 , if we allow the floating-point addition to be associative, there
are several techniques available for parallelizing the reduction.
a. [15] <4.4, 4.5> One technique is called recurrence doubling, which adds sequences of
progressively shorter vectors (i.e., two 32-element vectors, then two 16-element vec-
tors, and so on). Show how the C code would look for executing the second loop in
this way.
b. [25] <4.4, 4.5> In some vector processors, the individual elements within the vector re-
gisters are addressable. In this case, the operands to a vector operation may be two
diferent parts of the same vector register. This allows another solution for the reduc-
tion called partial sums . The idea is to reduce the vector to m sums where m is the
total latency through the vector functional unit, including the operand read and write
times. Assume that the VMIPS vector registers are addressable (e.g., you can initiate
a vector operation with the operand V1(16) , indicating that the input operand begins
with element 16). Also, assume that the total latency for adds, including the operand
read and result write, is eight cycles. Write a VMIPS code sequence that reduces the
contents of V1 to eight partial sums.
c. [25] <4.4, 4.5> When performing a reduction on a GPU, one thread is associated with
each element in the input vector. The first step is for each thread to write its corres-
ponding value into shared memory. Next, each thread enters a loop that adds each
pair of input values. This reduces the number of elements by half after each iteration,
meaning that the number of active threads also reduces by half after each iteration. In
order to maximize the performance of the reduction, the number of fully populated
warps should be maximized throughout the course of the loop. In other words, the
active threads should be contiguous. Also, each thread should index the shared array
in such a way as to avoid bank conflicts in the shared memory. The following loop
violates only the first of these guidelines and also uses the modulo operator which is
very expensive for GPUs:
unsigned int tid = threadIdx.x;
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if ((tid % (2*s)) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
Rewrite the loop to meet these guidelines and eliminate the use of the modulo oper-
ator. Assume that there are 32 threads per warp and a bank conflict occurs whenever
two or more threads from the same warp reference an index whose modulo by 32 are
equal.
4.12 [10/10/10/10] <4.3> The following kernel performs a portion of the finite-difference time-
domain (FDTD) method for computing Maxwell's equations in a three-dimensional space,
part of one of the SPEC06fp benchmarks:
for (int x=0; x<NX−1; x++) {
for (int y=0; y<NY−1; y++) {
for (int z=0; z<NZ−1; z++) {
int index = x*NY*NZ + y*NZ + z;
Search WWH ::




Custom Search