Hardware Reference
In-Depth Information
A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */
c. [15] <4.5> Consider the following loop:
for (i=0;i < 100;i++) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
Are there dependences between S1 and S2 ? Is this loop parallel? If not, show how to
make it parallel.
4.15 [10] <4.4> List and describe at least four factors that influence the performance of GPU
kernels. In other words, which runtime behaviors that are caused by the kernel code cause
a reduction in resource utilization during kernel execution?
4.16 [10] <4.4> Assume a hypothetical GPU with the following characteristics:
■ Clock rate 1.5 GHz
■ Contains 16 SIMD processors, each containing 16 single-precision floating-point units
■ Has 100 GB/sec off-chip memory bandwidth
Without considering memory bandwidth, what is the peak single-precision loating-point
throughput for this GPU in GLFOP/sec, assuming that all memory latencies can be hidden? Is
this throughput sustainable given the memory bandwidth limitation?
4.17 [60] <4.4> For this programming exercise, you will write and characterize the behavior
of a CUDA kernel that contains a high amount of data-level parallelism but also contains
conditional execution behavior. Use the NVIDIA CUDA Toolkit along with GPU-SIM from
the University of British Columbia ( htp://www.ece.ubc.ca/~aamodt/gpgpu-sim/ ) or the CUDA
Proiler to write and compile a CUDA kernel that performs 100 iterations of Conway's
Game of Life for a 256 × 256 game board and returns the final state of the game board to
the host. Assume that the board is initialized by the host. Associate one thread with each
cell. Make sure you add a barrier after each game iteration. Use the following game rules:
■ Any live cell with fewer than two live neighbors dies.
■ Any live cell with two or three live neighbors lives on to the next generation.
■ Any live cell with more than three live neighbors dies.
■ Any dead cell with exactly three live neighbors becomes a live cell.
After finishing the kernel answer the following questions:
a. [60] <4.4> Compile your code using the -ptx option and inspect the PTX representation
of your kernel. How many PTX instructions make up the PTX implementation of your
kernel? Did the conditional sections of your kernel include branch instructions or only
predicated non-branch instructions?
b. [60] <4.4> After executing your code in the simulator, what is the dynamic instruction
count? What is the achieved instructions per cycle (IPC) or instruction issue rate?
What is the dynamic instruction breakdown in terms of control instructions,
arithmetic-logical unit (ALU) instructions, and memory instructions? Are there any
shared memory bank conflicts? What is the effective off-chip memory bandwidth?
c. [60] <4.4> Implement an improved version of your kernel where off-chip memory ref-
erences are coalesced and observe the differences in runtime performance.
1 T 1This chapter is based on material in Appendix F, “Vector Processors,” by Krste Asanovic, and Appendix G, “Hard-
ware and Software for VLIW and EPIC” from the 4th edition of this topic; on material in Appendix A , “Graphics and
Computing GPUs,” by John Nickolls and David Kirk, from the 4th edition of Computer Organization and Design ; and
 
Search WWH ::




Custom Search