Graphics Reference
In-Depth Information
Fixed local work size. Using our notation introduced previously, work-item ( j, i )
performs the following reads in loop iteration k :
A [ i,k ] ,B [4 k +0 ,j ] ,B [4 k +1 ,j ] ,B [4 k +2 ,j ] ,B [4 k +3 ,j ] ,
where each memory access now loads a float4 vector. With many active threads,
we will first see all threads performing their first reads from A , and thereafter
we will see all threads performing their first read from B ,etc.Thisimpliesthat
reads that are executed after each other correspond to different threads executing
the same instruction in the program code. With a local work size of (4,32), the
GPU initiates the work-items for work-group ( m, n ) by incrementing the first
index first, i.e., in the order
(4 m, 32 n ) , (4 m +1 , 32 n ) , (4 m +2 , 32 n ) , (4 m +3 , 32 n ) ,
(4 m, 32 n +1) , (4 m +1 , 32 n +1) , (4 m +2 , 32 n +1) , (4 m +3 , 32 n +1) ,
(4 m, 32 n +2) , (4 m +1 , 32 n +2) , (4 m +2 , 32 n +2) , (4 m +3 , 32 n +2) ,
...,
(4 m, 32 n + 31) , (4 m +1 , 32 n + 31) , (4 m +2 , 32 n + 31) , (4 m +3 , 32 n + 31) ,
where we have again used the comma as a sequencing operation to describe the
ordering of global_id values of the work-items.
This means that the memory reads for loop iteration k will execute in the
following order:
32 n +31
A [ i,k ] ,
,
i =32 n
,
j =4 m
4 m +3
32 n +31
B [4 k +0 ,j ] , 32 n +31
B [4 k +1 ,j ] ,
,
i =32 n
,
j =4 m
,
i =32 n
,
j =4 m
4 m +3
4 m +3
32 n +31
B [4 k +2 ,j ] , 32 n +31
B [4 k +3 ,j ] ,
,
i =32 n
,
j =4 m
,
i =32 n
,
j =4 m
4 m +3
4 m +3
where the ID variable j is incremented before i as it corresponds to get_global_id(0) ,
and it is therefore written as the innermost
,
operator.
We see that the reads from A do not depend on j and are therefore repeated
for each group of four consecutive work-items, and we introduce the
×
operation
to reflect repetition of the same memory access as in
,
j =4 m
4 m +3
A [ i,k ]= A [ i,k ]
×
4
Search WWH ::




Custom Search