Table 16-1. Selected SPEC Benchmarks for Two UE 3500s - Cache Blocking - Multithreaded Programming with JAVA

So how to reduce cache misses? There are a couple of generalities that we can point to, but not

much more. Happily, these generalities do cover a lot of programs.

1. Write your program so that you never have to load a cache line more times than is

absolutely necessary.

2. Organize your data so that when you do load a cache line, you are able to make use of all

the data.

3. Keep data used regularly by different threads out of the same cache line.

Depending upon your particular program, it may or may not be reasonable to apply the above. For

well-behaved programs that reuse the data in cache many times, a great deal can be done just

covering these three rules. We can show a factor of 10 difference between a naive matrix multiply

and the most highly optimized implementation, all due to better cache management. For programs

with very poor data locality, such as NFS or databases, which spend a lot of time bringing in new

data and looking at it only once, it is almost impossible to do anything at all.

Two SPECfp95 benchmarks were submitted by Sun for almost the identical machine. The first

was a 400-MHz, 8-way UE 3500 with 4-MB E$. The second was the same, but with an 8-MB E$.

We show (in Table 16-1) the three most affected benchmarks along with the geometric average for

SPECfp.

Table 16-1. Selected SPEC Benchmarks for Two UE 3500s

4-MB E$

8-MB E$

101.tomcat

105

122

102.swim

278

671

107.mgrid

108

147

SPECfp95

50.1

57.7

Cache Blocking

For something like matrix manipulation or image processing, a naive algorithm might load and

reload a cache line numerous times. The same operation can be performed much faster in a more

clever algorithm that does cache blocking--arranging to load a subset of the data and use it many

times before loading a new block.

A naive multiply algorithm would multiply all of row 1 by column 1. Then row 1 by column 2,

column 3, etc. Next, row 2 would be multiplied with each column, etc. For a 1024 x 1024 matrix,

each row would be loaded only once, but the columns would be reloaded 1024 times! Assuming

64-bit floats and 64-byte cache lines, that adds up to a total of 128k cache loads.

A cache-blocked program would multiply rows 1-64 with columns 164, then columns 65128,

then 129192, etc. Each of those sets will fit completely in a 2-MB E$, so the total number of

cache loads will be reduced to a mere 16k column load plus 1k row loads.

That's the basics of cache blocking. There's plenty more that can be done. For example, you can

optimize I$ blocking on top of the E$ blocking. You can take into account the writing scheme

(does the CPU write back via the cache, write through the cache, or write around it?). You can

recall that E$ is physically mapped, hence it requires a TLB translation. (The translation lookaside

buffer performs high-speed virtual-to-physical mappings.) Of course, TLBs are very small. The

Sun TLB for the large SC2000 server holds a translation for only 0.5 MB, so if you can avoid

referencing data in cache beyond the current contents of the TLB, you can avoid extraneous TLB

misses. Then you may also wish to consider which data is coming from which memory bank.

Search WWH :

Custom Search

Multithreaded Programming with JAVA - Topic Index

Multithreaded Programming with JAVA - Bookmarks

Home