Graphics Reference
In-Depth Information
for them, and we find 1.25 and 1, respectively, showing that the second blocking
needs fewer load operations. We see that the added flexibility we are given to
choose blocking widths in the NT version can help us decrease the number of
memory operations, and this is one reason to expect that the NT variants will
perform better on the GPU.
In both cases, however, we perform O ( N 3 ) load operations from data of size
O ( N 2 ), which means that each datum is reloaded into registers O ( N ) times.
Effective cache usage will clearly be important, as it allows us to access data
from caches rather than from main memory.
7.5.6 L1 Cache Analysis of the 1
×
4
×
4 Blocked NN SGEMM
Overview. To estimate the cache usage of our kernels, we have to take into account
the order of reads within a work-item, as well as the fact that many work-items
are active at the same time. We will first look at one specific choice of program
and local work size and then try to extend the analysis to be able to compare the
cache usage of different implementations and local work sizes.
In the program we choose to analyze first, the 1
4 blocked NN implemen-
tation, every work-item performs five memory operations per iteration in its loop,
and we will assume that the memory operations take place in the same order as
they appear in the program (i.e., the compiler does not change their order), and
that, for a given work-item, memory operations never execute in parallel. An-
other restriction that we will also make in our analysis, and which is important,
is that we are able to perfectly predict the order in which work-items execute on
the GPU. Finally, this section will only focus on the L1 cache, which allows us
to restrict the analysis to a single core, as each core has its own L1 cache.
×
4
×
Thread order. A single work-item loops over the variable k , and for every value of
k , it performs one memory load from A and four memory loads from B .Withour
assumptions, we know that the work-item 17 will enter the load-store pipeline once
for each of those instructions, in the order they appear in the program source.
We also know that we schedule one work-group of work-items at a time, that
those work-items execute their memory operations in an interleaved fashion one
after the other, and that they always do this in the order they were spawned by
the GPU. We will now see what how we can use that knowledge to analyze the
L1 data cache that we need to use.
17 From a hardware point of view, we of course discuss the behavior and order of threads, but
we continue to use the term work-item, remembering that one work-item in OpenCL corresponds
to one thread in the GPU.
Search WWH ::




Custom Search