So how to reduce cache misses? There are a couple of generalities that we can point to, but not
much more. Happily, these generalities do cover a lot of programs.
1. Write your program so that you never have to load a cache line more times than is
2. Organize your data so that when you do load a cache line, you are able to make use of all
3. Keep data used regularly by different threads out of the same cache line.
Depending upon your particular program, it may or may not be reasonable to apply the above. For
well-behaved programs that reuse the data in cache many times, a great deal can be done just
covering these three rules. We can show a factor of 10 difference between a naive matrix multiply
and the most highly optimized implementation, all due to better cache management. For programs
with very poor data locality, such as NFS or databases, which spend a lot of time bringing in new
data and looking at it only once, it is almost impossible to do anything at all.
Two SPECfp95 benchmarks were submitted by Sun for almost the identical machine. The first
was a 400-MHz, 8-way UE 3500 with 4-MB E$. The second was the same, but with an 8-MB E$.
We show (in Table 16-1) the three most affected benchmarks along with the geometric average for
Table 16-1. Selected SPEC Benchmarks for Two UE 3500s
For something like matrix manipulation or image processing, a naive algorithm might load and
reload a cache line numerous times. The same operation can be performed much faster in a more
clever algorithm that does cache blocking--arranging to load a subset of the data and use it many
times before loading a new block.
A naive multiply algorithm would multiply all of row 1 by column 1. Then row 1 by column 2,
column 3, etc. Next, row 2 would be multiplied with each column, etc. For a 1024 x 1024 matrix,
each row would be loaded only once, but the columns would be reloaded 1024 times! Assuming
64-bit floats and 64-byte cache lines, that adds up to a total of 128k cache loads.
A cache-blocked program would multiply rows 1-64 with columns 164, then columns 65128,
then 129192, etc. Each of those sets will fit completely in a 2-MB E$, so the total number of
cache loads will be reduced to a mere 16k column load plus 1k row loads.
That's the basics of cache blocking. There's plenty more that can be done. For example, you can
optimize I$ blocking on top of the E$ blocking. You can take into account the writing scheme
(does the CPU write back via the cache, write through the cache, or write around it?). You can
recall that E$ is physically mapped, hence it requires a TLB translation. (The translation lookaside
buffer performs high-speed virtual-to-physical mappings.) Of course, TLBs are very small. The
Sun TLB for the large SC2000 server holds a translation for only 0.5 MB, so if you can avoid
referencing data in cache beyond the current contents of the TLB, you can avoid extraneous TLB
misses. Then you may also wish to consider which data is coming from which memory bank.
Search WWH :