Review of Memory Hierarchy - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

The cache index selects the tag to be tested to see if the desired block is in the cache. The size

of the index depends on cache size, block size, and set associativity. For the Opteron cache the

set associativity is set to two, and we calculate the index as follows:

Hence, the index is 9 bits wide, and the tag is 34 - 9 or 25 bits wide. Although that is the in-

dex needed to select the proper block, 64 bytes is much more than the processor wants to con-

sume at once. Hence, it makes more sense to organize the data portion of the cache memory 8

bytes wide, which is the natural data word of the 64-bit Opteron processor. Thus, in addition

to 9 bits to index the proper cache block, 3 more bits from the block offset are used to index

the proper 8 bytes. Index selection is step 2 in Figure B.5 .

After reading the two tags from the cache, they are compared to the tag portion of the block

address from the processor. This comparison is step 3 in the figure. To be sure the tag contains

valid information, the valid bit must be set or else the results of the comparison are ignored.

Assuming one tag does match, the final step is to signal the processor to load the proper

data from the cache by using the winning input from a 2:1 multiplexor. The Opteron allows 2

clock cycles for these four steps, so the instructions in the following 2 clock cycles would waitif

if they tried to use the result of the load.

Handling writes is more complicated than handling reads in the Opteron, as it is in any

cache. If the word to be writen is in the cache, the irst three steps are the same. Since the

Opteron executes out of order, only after it signals that the instruction has commited and the

cache tag comparison indicates a hit are the data writen to the cache.

So far we have assumed the common case of a cache hit. What happens on a miss? On a read

miss, the cache sends a signal to the processor telling it the data are not yet available, and 64

bytes are read from the next level of the hierarchy. The latency is 7 clock cycles to the first 8

bytes of the block, and then 2 clock cycles per 8 bytes for the rest of the block. Since the data

cache is set associative, there is a choice on which block to replace. Opteron uses LRU, which

selects the block that was referenced longest ago, so every access must update the LRU bit.

Replacing a block means updating the data, the address tag, the valid bit, and the LRU bit.

Since the Opteron uses write-back, the old data block could have been modified, and hence

it cannot simply be discarded. The Opteron keeps 1 dirty bit per block to record if the block

was writen. If the “victim” was modiied, its data and address are sent to the victim buffer.

(This structure is similar to a write buffer in other computers.) The Opteron has space for eight

victim blocks. In parallel with other cache actions, it writes victim blocks to the next level of

the hierarchy. If the victim buffer is full, the cache must waitif

A write miss is very similar to a read miss, since the Opteron allocates a block on a read or

a write miss.

We have seen how it works, but the data cache cannot supply all the memory needs of the

processor: The processor also needs instructions. Although a single cache could try to sup-

ply both, it can be a botleneck. For example, when a load or store instruction is executed, the

pipelined processor will simultaneously request both a data word and an instruction word.

Hence, a single cache would present a structural hazard for loads and stores, leading to stalls.

One simple way to conquer this problem is to divide it: One cache is dedicated to instructions

Search WWH ::

Custom Search

Home