THE MICROARCHITECTURE LEVEL - Structured Computer Organization

Hardware Reference

In-Depth Information

issued in short bursts since soon the scheduler queues will empty. Also, the memo-

ry units each take four cycles to process their operations, thus they could contribute

to the peak execution throughput only in small bursts. Despite not being able to

fully saturate the execution resources, the functional units do provide a significant

execution capability, and that is why the out-of-order control goes to so much trou-

ble to find work for them to do.

The three integer ALUs are not identical. ALU 1 can perform all arithmetic

and logical operations and multiplies and divides. ALU 2 can perform only arith-

metic and logical operations. ALU 3 can perform arithmetic and logical operations

and resolve branches. Similarly, the two floating-point units are not identical ei-

ther. The first one can perform floating-point arithmetic including multiplies,

while the second one can perform only floating-point adds, subtracts, and moves.

The ALU and floating-point units are fed by a pair of 128-entry register files,

one for integers and one for floating-point numbers. These provide all the oper-

ands for the instructions to be executed and provide a repository for results. Due to

the register renaming, eight of them contain the registers visible at the ISA level

( EAX , EBX , ECX , EDX , etc.), but which eight hold the ''real'' values varies over time

as the mapping changes during execution.

The Sandy Bridge architecture introduced the Advanced Vector Extensions

(AVX), which supports 128-bit data-parallel vector operations. The vector opera-

tions include both floating-point and integer vectors, and this new ISA extension

represents a two-times increase in the size of vectors now supported compared to

the previous SSE and SSE2 ISA extensions. How does the architecture implement

256-bit operations with only 128-bit data paths and functional units? It cleverly

coordinates two 128-bit scheduler ports to produce a single 256-bit functional unit.

The L1 data cache is tightly coupled into the back end of the Sandy Bridge

pipeline. It is a 32-KB cache and holds integers, floating-point numbers, and other

kinds of data. Unlike the micro-op cache, it is not decoded in any way. It just

holds a copy of the bytes in memory. The L1 data cache is an 8-way associative

cache with 64 bytes per cache line. It is a write-back cache, meaning that when a

cache line is modified, that line's dirty bit is set and the data are copied back to the

L2 cache when evicted from the L1 data cache. The cache can handle two read

and one write operation per clock cycle. These multiple accesses are implemented

using banking , which splits the cache into separate subcaches (8 in the Sandy

Bridge case). As long as as all three accesses are to separate banks, they can pro-

ceed in tandem; otherwise, all but one of the conflicting bank accesses will have to

stall. When a needed word is not present in the L1 cache, a request is sent to the

L2 cache, which either responds immediately or fetches the cache line from the

shared L3 cache and then responds. Up to ten requests from the L1 cache to the L2

cache can be in progress at any instant.

Because micro-ops are executed out of order, stores into the L1 cache are not

permitted until all instructions preceding a particular store have have been retired.

The retirement unit has the job of retiring instructions, in order, and keeping track

Structured Computer Organization

Search WWH ::

Custom Search

Home