Hardware Reference
In-Depth Information
if (y>0 && x >0) {
material = IDx[index];
dH1 = (Hz[index] − Hz[index-incrementY])/dy[y];
dH2 = (Hy[index] − Hy[index-incrementZ])/dz[z];
Ex[index] = Ca[material]*Ex[index]+Cb[material]*(dH2−dH1);
}}}}
Assume that dH1 , dH2 , Hy , Hz , dy , dz , Ca , Cb , and Ex are all single-precision floating-point arrays.
Assume IDx is an array of unsigned int.
a. [10] <4.3> What is the arithmetic intensity of this kernel?
b. [10] <4.3> Is this kernel amenable to vector or SIMD execution? Why or why not?
c. [10] <4.3> Assume this kernel is to be executed on a processor that has 30 GB/sec of
memory bandwidth. Will this kernel be memory bound or compute bound?
d. [10] <4.3> Develop a roofline model for this processor, assuming it has a peak compu-
tational throughput of 85 GFLOP/sec.
4.13 [10/15] <4.4> Assume a GPU architecture that contains 10 SIMD processors. Each SIMD
instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision
arithmetic and load/store instructions, meaning that each non-diverged SIMD instruction
can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that
causes on average 80% of threads to be active. Assume that 70% of all SIMD instructions ex-
ecuted are single-precision arithmetic and 20% are load/store. Since not all memory laten-
cies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the
GPU has a clock speed of 1.5 GHz.
a. [10] <4.4> Compute the throughput, in GFLOP/sec, for this kernel on this GPU.
b. [15] <4.4> Assume that you have the following choices:
Increasing the number of single-precision lanes to 16Increasing the number
of SIMD processors to 15 (assume this change doesn't affect any other per-
formance metrics and that the code scales to the additional processors)Ad-
ding a cache that will effectively reduce memory latency by 40%, which will
increase instruction issue rate to 0.95
What is speedup in throughput for each of these improvements?
4.14 [10/15/15] <4.5> In this exercise, we will examine several loops and analyze their poten-
tial for parallelization.
a. [10] <4.5> Does the following loop have a loop-carried dependency?
for (i=0;i<100;i++) {
A[i] = B[2*i+4];
B[4*i+5] = A[i];
}
b. [15] <4.5> In the following loop, find all the true dependences, output dependences,
and antidependences. Eliminate the output dependences and antidependences by re-
naming.
for (i=0;i<100;i++) {
A[i] = A[i] * B[i]; /* S1 */
B[i] = A[i] + c; /* S2 */
Search WWH ::




Custom Search