Hardware Reference
In-Depth Information
An instruction TLB miss first goes to the L2 TLB, which contains 512 PTEs of 4 KB page
sizes and is four-way set associative. It takes two clock cycles to load the L1 TLB from the L2
TLB. If the L2 TLB misses, a hardware algorithm is used to walk the page table and update
the TLB entry. In the worst case, the page is not in memory, and the operating system gets the
page from disk. Since millions of instructions could execute during a page fault, the operating
system will swap in another process if one is waiting to run. Otherwise, if there is no TLB ex-
ception, the instruction cache access continues.
The index field of the address is sent to all four banks of the instruction cache (step 5). The
instruction cache tag is 36 − 7 bits (index) − 6 bits (block offset), or 23 bits. The four tags and
valid bits are compared to the physical page frame from the instruction TLB (step 6). As the if
i7expects 16 bytes each instruction fetch, an additional 2 bits are used from the 6-bit block offset
to select the appropriate 16 bytes. Hence, 7 + 2 or 9 bits are used to send 16 bytes of instruc-
tions to the processor. The L1 cache is pipelined, and the latency of a hit is 4 clock cycles (step
7). A miss goes to the second-level cache.
As mentioned earlier, the instruction cache is virtually addressed and physically tagged. Be-
cause the second-level caches are physically addressed, the physical page address from the
TLB is composed with the page offset to make an address to access the L2 cache. The L2 index
is
so the 30-bit block address (36-bit physical address − 6-bit block offset) is divided into a
21-bit tag and a 9-bit index (step 8). Once again, the index and tag are sent to all eight banks of
the unified L2 cache (step 9), which are compared in parallel. If one matches and is valid (step
10), it returns the block in sequential order after the initial 10-cycle latency at a rate of 8 bytes
per clock cycle.
If the L2 cache misses, the L3 cache is accessed. For a four-core i7, which has an 8 MB L3, the
index size is
The 13-bit index (step 11) is sent to all 16 banks of the L3 (step 12). The L3 tag, which is 36
− (13 + 6) = 17 bits, is compared against the physical address from the TLB (step 13). If a hit
occurs, the block is returned after an initial latency at a rate of 16 bytes per clock and placed
into both L1 and L3. If L3 misses, a memory access is initiated.
If the instruction is not found in the L3 cache, the on-chip memory controller must get the
block from main memory. The i7 has three 64-bit memory channels that can act as one 192-bit
channel, since there is only one memory controller and the same address is sent on both chan-
nels (step 14). Wide transfers happen when both channels have identical DIMMs. Each chan-
nel supports up to four DDR DIMMs (step 15). When the data return they are placed into L3
and L1 (step 16) because L3 is inclusive.
Search WWH ::




Custom Search