THE MICROARCHITECTURE LEVEL - Structured Computer Organization

Hardware Reference

In-Depth Information

For a taken branch the BTB ( Branch Target Buffer ) is consulted to determine

the target address. The BTB holds the target address of the branch the last time it

was taken. Most of the time this address is correct (in fact, it is always correct for

branches with a constant displacement). Indirect branches, such as those used by

virtual function calls and C++ switch statements, go to many addresses, and they

may be mispredicted by the BTB.

The second part of the pipeline, the out-of-order control logic, is fed from the

micro-op cache. As each micro-op comes in from the front end, up to four per

cycle, the allocation/renaming unit logs it in a 168-entry table called the ROB

( ReOrder Buffer ). This entry keeps track of the status of the micro-op until it is

retired. The allocation/renaming unit then checks to see if the resources the micro-

op needs are available. If so, the micro-op is enenqueued for execution in one of

the scheduler queues. Separate queues are maintained for memory and nonmemo-

ry micro-ops. If a micro-op cannot be executed, it is delayed, but subsequent

micro-ops are processed, leading to out-of-order execution of the micro-ops. This

strategy was designed to keep all the functional units as busy as possible. As many

as 154 instructions can be in flight at any instant, and up to 64 of these can be

loads from memory and up to 36 can be stores into memory.

Sometimes a micro-op stalls because it needs to write into a register that is

being read or written by a previous micro-op. These conflicts are called WAR and

WAW dependences, respectively, as we saw earlier. By renaming the target of the

new micro-op to allow it to write its result in one of the 160 scratch registers in-

stead of in the intended, but still-busy, target, it may be possible to schedule the

micro-op for execution immediately. If no scratch register is available, or the

micro-op has a RAW dependence (which can never be papered over), the allocator

notes the nature of the problem in the ROB entry. When all the required resources

become available later, the micro-op is put into one of the scheduler queues.

The scheduler queues send micro-ops into the six functional units when they

are ready to execute. The functional units are as follows:

1. ALU 1 and the floating-point multiply unit.

2. ALU 2 and the floating-point add/subtract unit.

3. ALU 3 and branch processing and floating-point comparisons unit.

4. Store instructions.

5. Load instructions 1.

6. Load instructions 2.

Since the schedulers and the ALUs can process one operation per cycle, a 3-GHz

Core i7 has the scheduler performance to issue 18 billion operations per second;

however, in reality the processor will never reach this level of throughput. Since

the front end supplies at most four micro-ops per cycle, six micro-ops can only be

Search WWH ::

Custom Search

Home