Hardware Reference
In-Depth Information
issued in short bursts since soon the scheduler queues will empty. Also, the memo-
ry units each take four cycles to process their operations, thus they could contribute
to the peak execution throughput only in small bursts. Despite not being able to
fully saturate the execution resources, the functional units do provide a significant
execution capability, and that is why the out-of-order control goes to so much trou-
ble to find work for them to do.
The three integer ALUs are not identical. ALU 1 can perform all arithmetic
and logical operations and multiplies and divides. ALU 2 can perform only arith-
metic and logical operations. ALU 3 can perform arithmetic and logical operations
and resolve branches. Similarly, the two floating-point units are not identical ei-
ther. The first one can perform floating-point arithmetic including multiplies,
while the second one can perform only floating-point adds, subtracts, and moves.
The ALU and floating-point units are fed by a pair of 128-entry register files,
one for integers and one for floating-point numbers. These provide all the oper-
ands for the instructions to be executed and provide a repository for results. Due to
the register renaming, eight of them contain the registers visible at the ISA level
( EAX , EBX , ECX , EDX , etc.), but which eight hold the ''real'' values varies over time
as the mapping changes during execution.
The Sandy Bridge architecture introduced the Advanced Vector Extensions
(AVX), which supports 128-bit data-parallel vector operations. The vector opera-
tions include both floating-point and integer vectors, and this new ISA extension
represents a two-times increase in the size of vectors now supported compared to
the previous SSE and SSE2 ISA extensions. How does the architecture implement
256-bit operations with only 128-bit data paths and functional units? It cleverly
coordinates two 128-bit scheduler ports to produce a single 256-bit functional unit.
The L1 data cache is tightly coupled into the back end of the Sandy Bridge
pipeline. It is a 32-KB cache and holds integers, floating-point numbers, and other
kinds of data. Unlike the micro-op cache, it is not decoded in any way. It just
holds a copy of the bytes in memory. The L1 data cache is an 8-way associative
cache with 64 bytes per cache line. It is a write-back cache, meaning that when a
cache line is modified, that line's dirty bit is set and the data are copied back to the
L2 cache when evicted from the L1 data cache. The cache can handle two read
and one write operation per clock cycle. These multiple accesses are implemented
using banking , which splits the cache into separate subcaches (8 in the Sandy
Bridge case). As long as as all three accesses are to separate banks, they can pro-
ceed in tandem; otherwise, all but one of the conflicting bank accesses will have to
stall. When a needed word is not present in the L1 cache, a request is sent to the
L2 cache, which either responds immediately or fetches the cache line from the
shared L3 cache and then responds. Up to ten requests from the L1 cache to the L2
cache can be in progress at any instant.
Because micro-ops are executed out of order, stores into the L1 cache are not
permitted until all instructions preceding a particular store have have been retired.
The retirement unit has the job of retiring instructions, in order, and keeping track
 
Search WWH ::




Custom Search