Hardware Reference
In-Depth Information
c. [15] <2.2> Write code to perform a transpose with a block size parameter B which uses
B × B blocks.
d. [12] <2.2> What is the minimum associativity required of the L1 cache for consistent
performance independent of both arrays' position in memory?
e. [20] <2.2> Try out blocked and nonblocked 256 × 256 matrix transpositions on a com-
puter. How closely do the results match your expectations based on what you know
about the computer's memory system? Explain any discrepancies if possible.
2.2 [10] <2.2> Assume you are designing a hardware prefetcher for the unblocked matrix trans-
position code above. The simplest type of hardware prefetcher only prefetches sequential
cache blocks after a miss. More complicated “non-unit stride” hardware prefetchers can
analyze a miss reference stream and detect and prefetch non-unit strides. In contrast, soft-
ware prefetching can determine non-unit strides as easily as it can determine unit strides.
Assume prefetches write directly into the cache and that there is no “pollution” (overwrit-
ing data that must be used before the data that are prefetched). For best performance giv-
en a non-unit stride prefetcher, in the steady state of the inner loop how many prefetches
must be outstanding at a given time?
2.3 [15/20] <2.2> With software prefetching it is important to be careful to have the prefetches
occur in time for use but also to minimize the number of outstanding prefetches to live
within the capabilities of the microarchitecture and minimize cache pollution. This is com-
plicated by the fact that different processors have different capabilities and limitations.
a. [15] <2.2> Create a blocked version of the matrix transpose with software prefetching.
b. [20] <2.2> Estimate and compare the performance of the blocked and unblocked trans-
pose codes both with and without software prefetching.
Case Study 2: Putting It All Together: Highly Parallel Memory
Systems
Concept illustrated by this case study
■ Crosscuting Issues: The Design of Memory Hierarchies
The program in Figure 2.29 can be used to evaluate the behavior of a memory system. The
key is having accurate timing and then having the program stride through memory to invoke
different levels of the hierarchy. Figure 2.29 shows the code in C. The first part is a procedure
that uses a standard utility to get an accurate measure of the user CPU time; this procedure
may have to be changed to work on some systems. The second part is a nested loop to read
and write memory at different strides and cache sizes. To get accurate cache timing, this code
is repeated many times. The third part times the nested loop overhead only so that it can be
subtracted from overall measured times to see how long the accesses were. The results are
output in .csv file format to facilitate importing into spreadsheets. You may need to change
CACHE_MAX depending on the question you are answering and the size of memory on the system
you are measuring. Running the program in single-user mode or at least without other active
applications will give more consistent results. The code in Figure 2.29 was derived from a pro-
gram written by Andrea Dusseau at the University of California-Berkeley and was based on
a detailed description found in Saavedra-Barrera [1992] . It has been modified to fix a number
of issues with more modern machines and to run under Microsoft Visual C++. It can be down-
loaded from www.hpl.hp.com/research/cacti/aca_ch2_cs2.c .
Search WWH ::




Custom Search