Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

LI R2,0

SW R2,vec

SW R2,vec+4

LI R2,2000

SW R2,vec+8

SW R2,vec+12

LV V1,vec

Assume the maximum vector length is 64. Is there any way performance can be improved

using gather-scater loads? If so, by how much?

4.5 [25] <4.4> Now assume we want to implement the MrBayes kernel on a GPU using a

single thread block. Rewrite the C code of the kernel using CUDA. Assume that pointers

to the conditional likelihood and transition probability tables are specified as parameters

to the kernel. Invoke one thread for each iteration of the loop. Load any reused values into

shared memory before performing operations on it.

4.6 [15] <4.4> With CUDA we can use coarse-grain parallelism at the block level to compute

the conditional likelihoods of multiple nodes in parallel. Assume that we want to compute

the conditional likelihoods from the botom of the tree up. Assume that the conditional

likelihood and transition probability arrays are organized in memory as described in ques-

tion 4 and the group of tables for each of the 12 leaf nodes is also stored in consecutive

memory locations in the order of node number. Assume that we want to compute the con-

ditional likelihood for nodes 12 to 17, as shown in Figure 4.33 . Change the method by

which you compute the array indices in your answer from Exercise 4.5 to include the block

number.

FIGURE 4.33 Sample tree .

4.7 [15] <4.4> Convert your code from Exercise 4.6 into PTX code. How many instructions are

needed for the kernel?

4.8 [10] <4.4> How well do you expect this code to perform on a GPU? Explain your answer.

Search WWH ::

Custom Search

Home