Graphics Reference
In-Depth Information
Fixed local work size.
Using our notation introduced previously, work-item (
j, i
)
performs the following reads in loop iteration
k
:
A
[
i,k
]
,B
[4
k
+0
,j
]
,B
[4
k
+1
,j
]
,B
[4
k
+2
,j
]
,B
[4
k
+3
,j
]
,
where each memory access now loads a
float4
vector. With many active threads,
we will first see all threads performing their first reads from
A
, and thereafter
we will see all threads performing their first read from
B
,etc.Thisimpliesthat
reads that are executed after each other correspond to different threads executing
the same instruction in the program code. With a local work size of (4,32), the
GPU initiates the work-items for work-group (
m, n
) by incrementing the first
index first, i.e., in the order
(4
m,
32
n
)
,
(4
m
+1
,
32
n
)
,
(4
m
+2
,
32
n
)
,
(4
m
+3
,
32
n
)
,
(4
m,
32
n
+1)
,
(4
m
+1
,
32
n
+1)
,
(4
m
+2
,
32
n
+1)
,
(4
m
+3
,
32
n
+1)
,
(4
m,
32
n
+2)
,
(4
m
+1
,
32
n
+2)
,
(4
m
+2
,
32
n
+2)
,
(4
m
+3
,
32
n
+2)
,
...,
(4
m,
32
n
+ 31)
,
(4
m
+1
,
32
n
+ 31)
,
(4
m
+2
,
32
n
+ 31)
,
(4
m
+3
,
32
n
+ 31)
,
where we have again used the comma as a sequencing operation to describe the
ordering of
global_id
values of the work-items.
This means that the memory reads for loop iteration
k
will execute in the
following order:
32
n
+31
A
[
i,k
]
,
,
i
=32
n
,
j
=4
m
4
m
+3
32
n
+31
B
[4
k
+0
,j
]
,
32
n
+31
B
[4
k
+1
,j
]
,
,
i
=32
n
,
j
=4
m
,
i
=32
n
,
j
=4
m
4
m
+3
4
m
+3
32
n
+31
B
[4
k
+2
,j
]
,
32
n
+31
B
[4
k
+3
,j
]
,
,
i
=32
n
,
j
=4
m
,
i
=32
n
,
j
=4
m
4
m
+3
4
m
+3
where the ID variable
j
is incremented before
i
as it corresponds to
get_global_id(0)
,
and it is therefore written as the innermost
,
operator.
We see that the reads from
A
do not depend on
j
and are therefore repeated
for each group of four consecutive work-items, and we introduce the
×
operation
to reflect repetition of the same memory access as in
,
j
=4
m
4
m
+3
A
[
i,k
]=
A
[
i,k
]
×
4