Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

A[i] = C[i] * c; /* S3 */

C[i] = D[i] * A[i]; /* S4 */

c. [15] <4.5> Consider the following loop:

for (i=0;i < 100;i++) {

A[i] = A[i] + B[i]; /* S1 */

B[i+1] = C[i] + D[i]; /* S2 */

}

Are there dependences between S1 and S2 ? Is this loop parallel? If not, show how to

make it parallel.

4.15 [10] <4.4> List and describe at least four factors that influence the performance of GPU

kernels. In other words, which runtime behaviors that are caused by the kernel code cause

a reduction in resource utilization during kernel execution?

4.16 [10] <4.4> Assume a hypothetical GPU with the following characteristics:

■ Clock rate 1.5 GHz

■ Contains 16 SIMD processors, each containing 16 single-precision floating-point units

■ Has 100 GB/sec off-chip memory bandwidth

Without considering memory bandwidth, what is the peak single-precision loating-point

throughput for this GPU in GLFOP/sec, assuming that all memory latencies can be hidden? Is

this throughput sustainable given the memory bandwidth limitation?

4.17 [60] <4.4> For this programming exercise, you will write and characterize the behavior

of a CUDA kernel that contains a high amount of data-level parallelism but also contains

conditional execution behavior. Use the NVIDIA CUDA Toolkit along with GPU-SIM from

the University of British Columbia ( htp://www.ece.ubc.ca/~aamodt/gpgpu-sim/ ) or the CUDA

Proiler to write and compile a CUDA kernel that performs 100 iterations of Conway's

Game of Life for a 256 × 256 game board and returns the final state of the game board to

the host. Assume that the board is initialized by the host. Associate one thread with each

cell. Make sure you add a barrier after each game iteration. Use the following game rules:

■ Any live cell with fewer than two live neighbors dies.

■ Any live cell with two or three live neighbors lives on to the next generation.

■ Any live cell with more than three live neighbors dies.

■ Any dead cell with exactly three live neighbors becomes a live cell.

After finishing the kernel answer the following questions:

a. [60] <4.4> Compile your code using the -ptx option and inspect the PTX representation

of your kernel. How many PTX instructions make up the PTX implementation of your

kernel? Did the conditional sections of your kernel include branch instructions or only

predicated non-branch instructions?

b. [60] <4.4> After executing your code in the simulator, what is the dynamic instruction

count? What is the achieved instructions per cycle (IPC) or instruction issue rate?

What is the dynamic instruction breakdown in terms of control instructions,

arithmetic-logical unit (ALU) instructions, and memory instructions? Are there any

shared memory bank conflicts? What is the effective off-chip memory bandwidth?

c. [60] <4.4> Implement an improved version of your kernel where off-chip memory ref-

erences are coalesced and observe the differences in runtime performance.

1 T 1This chapter is based on material in Appendix F, “Vector Processors,” by Krste Asanovic, and Appendix G, “Hard-

ware and Software for VLIW and EPIC” from the 4th edition of this topic; on material in Appendix A , “Graphics and

Computing GPUs,” by John Nickolls and David Kirk, from the 4th edition of Computer Organization and Design ; and

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home