Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

if (y>0 && x >0) {

material = IDx[index];

dH1 = (Hz[index] − Hz[index-incrementY])/dy[y];

dH2 = (Hy[index] − Hy[index-incrementZ])/dz[z];

Ex[index] = Ca[material]*Ex[index]+Cb[material]*(dH2−dH1);

}}}}

Assume that dH1 , dH2 , Hy , Hz , dy , dz , Ca , Cb , and Ex are all single-precision floating-point arrays.

Assume IDx is an array of unsigned int.

a. [10] <4.3> What is the arithmetic intensity of this kernel?

b. [10] <4.3> Is this kernel amenable to vector or SIMD execution? Why or why not?

c. [10] <4.3> Assume this kernel is to be executed on a processor that has 30 GB/sec of

memory bandwidth. Will this kernel be memory bound or compute bound?

d. [10] <4.3> Develop a roofline model for this processor, assuming it has a peak compu-

tational throughput of 85 GFLOP/sec.

4.13 [10/15] <4.4> Assume a GPU architecture that contains 10 SIMD processors. Each SIMD

instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision

arithmetic and load/store instructions, meaning that each non-diverged SIMD instruction

can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that

causes on average 80% of threads to be active. Assume that 70% of all SIMD instructions ex-

ecuted are single-precision arithmetic and 20% are load/store. Since not all memory laten-

cies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the

GPU has a clock speed of 1.5 GHz.

a. [10] <4.4> Compute the throughput, in GFLOP/sec, for this kernel on this GPU.

b. [15] <4.4> Assume that you have the following choices:

Increasing the number of single-precision lanes to 16Increasing the number

of SIMD processors to 15 (assume this change doesn't affect any other per-

formance metrics and that the code scales to the additional processors)Ad-

ding a cache that will effectively reduce memory latency by 40%, which will

increase instruction issue rate to 0.95

What is speedup in throughput for each of these improvements?

4.14 [10/15/15] <4.5> In this exercise, we will examine several loops and analyze their poten-

tial for parallelization.

a. [10] <4.5> Does the following loop have a loop-carried dependency?

for (i=0;i<100;i++) {

A[i] = B[2*i+4];

B[4*i+5] = A[i];

}

b. [15] <4.5> In the following loop, find all the true dependences, output dependences,

and antidependences. Eliminate the output dependences and antidependences by re-

naming.

for (i=0;i<100;i++) {

A[i] = A[i] * B[i]; /* S1 */

B[i] = A[i] + c; /* S2 */

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home