Hardware Reference
In-Depth Information
LI R2,0
SW R2,vec
SW R2,vec+4
LI R2,2000
SW R2,vec+8
SW R2,vec+12
LV V1,vec
Assume the maximum vector length is 64. Is there any way performance can be improved
using gather-scater loads? If so, by how much?
4.5 [25] <4.4> Now assume we want to implement the MrBayes kernel on a GPU using a
single thread block. Rewrite the C code of the kernel using CUDA. Assume that pointers
to the conditional likelihood and transition probability tables are specified as parameters
to the kernel. Invoke one thread for each iteration of the loop. Load any reused values into
shared memory before performing operations on it.
4.6 [15] <4.4> With CUDA we can use coarse-grain parallelism at the block level to compute
the conditional likelihoods of multiple nodes in parallel. Assume that we want to compute
the conditional likelihoods from the botom of the tree up. Assume that the conditional
likelihood and transition probability arrays are organized in memory as described in ques-
tion 4 and the group of tables for each of the 12 leaf nodes is also stored in consecutive
memory locations in the order of node number. Assume that we want to compute the con-
ditional likelihood for nodes 12 to 17, as shown in Figure 4.33 . Change the method by
which you compute the array indices in your answer from Exercise 4.5 to include the block
number.
FIGURE 4.33 Sample tree .
4.7 [15] <4.4> Convert your code from Exercise 4.6 into PTX code. How many instructions are
needed for the kernel?
4.8 [10] <4.4> How well do you expect this code to perform on a GPU? Explain your answer.
 
Search WWH ::




Custom Search