Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

c. [15] <2.2> Write code to perform a transpose with a block size parameter B which uses

B × B blocks.

d. [12] <2.2> What is the minimum associativity required of the L1 cache for consistent

performance independent of both arrays' position in memory?

e. [20] <2.2> Try out blocked and nonblocked 256 × 256 matrix transpositions on a com-

puter. How closely do the results match your expectations based on what you know

about the computer's memory system? Explain any discrepancies if possible.

2.2 [10] <2.2> Assume you are designing a hardware prefetcher for the unblocked matrix trans-

position code above. The simplest type of hardware prefetcher only prefetches sequential

cache blocks after a miss. More complicated “non-unit stride” hardware prefetchers can

analyze a miss reference stream and detect and prefetch non-unit strides. In contrast, soft-

ware prefetching can determine non-unit strides as easily as it can determine unit strides.

Assume prefetches write directly into the cache and that there is no “pollution” (overwrit-

ing data that must be used before the data that are prefetched). For best performance giv-

en a non-unit stride prefetcher, in the steady state of the inner loop how many prefetches

must be outstanding at a given time?

2.3 [15/20] <2.2> With software prefetching it is important to be careful to have the prefetches

occur in time for use but also to minimize the number of outstanding prefetches to live

within the capabilities of the microarchitecture and minimize cache pollution. This is com-

plicated by the fact that different processors have different capabilities and limitations.

a. [15] <2.2> Create a blocked version of the matrix transpose with software prefetching.

b. [20] <2.2> Estimate and compare the performance of the blocked and unblocked trans-

pose codes both with and without software prefetching.

Case Study 2: Putting It All Together: Highly Parallel Memory

Systems

Concept illustrated by this case study

■ Crosscuting Issues: The Design of Memory Hierarchies

The program in Figure 2.29 can be used to evaluate the behavior of a memory system. The

key is having accurate timing and then having the program stride through memory to invoke

different levels of the hierarchy. Figure 2.29 shows the code in C. The first part is a procedure

that uses a standard utility to get an accurate measure of the user CPU time; this procedure

may have to be changed to work on some systems. The second part is a nested loop to read

and write memory at different strides and cache sizes. To get accurate cache timing, this code

is repeated many times. The third part times the nested loop overhead only so that it can be

subtracted from overall measured times to see how long the accesses were. The results are

output in .csv file format to facilitate importing into spreadsheets. You may need to change

CACHE_MAX depending on the question you are answering and the size of memory on the system

you are measuring. Running the program in single-user mode or at least without other active

applications will give more consistent results. The code in Figure 2.29 was derived from a pro-

gram written by Andrea Dusseau at the University of California-Berkeley and was based on

a detailed description found in Saavedra-Barrera [1992] . It has been modified to fix a number

of issues with more modern machines and to run under Microsoft Visual C++. It can be down-

loaded from www.hpl.hp.com/research/cacti/aca_ch2_cs2.c .

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home