THE MICROARCHITECTURE LEVEL - Structured Computer Organization - page 330

Hardware Reference

In-Depth Information

the key components are similar to those used in the Core i7. The similarities are

driven mostly by technology, power constraints, and economics. For example,

both designs employ a multilevel cache hierarchy to meet the tight cost constraints

of typical embedded applications; however, the last level of the Cortex A9's cache

memory system (L2) is only 1 MB in size, significantly smaller than the Core i7

which supports last level caches (L3) of up to 20 MB. The differences, in contrast,

are due mostly to the difference between having to bridge the gap between an old

CISC instruction set and a modern RISC core and not having to do so.

To LPDDR2

memory

Level 1

inst cache

Fast loop

look-aside

Branch predictor

/

Branch target

address

cache

System

interface

Memory

controller

Instruction issue unit/

decoder/renamer

Level 2

unified

cache

Instruction

queue

Level 1

data cache

Load-store unit/

store buffer

ALUs

FPUs

Retirement

Figure 4-48. The block diagram of the OMAP4430's Cortex A9 microarchitecture.

At the top of Fig. 4-48 is the 32-KB 4-way associative instruction cache, which

uses 32-byte cache lines. Since most ARM instructions are 4 bytes, there is room

for about 8K instructions here in this cache, quite a bit larger than the Core i7's

micro-op cache.

The instruction issue unit prepares up to four instructions for execution per

clock cycle. If there is a miss on the L1 cache, fewer instructions will be issued.

When a conditional branch is encountered, a branch predictor with 4K entries is

consulted to predict whether or not the branch will be taken. If predicted taken, the

1K entry branch-target-address cache is consulted for the predicted target address.

In addition, if the front end detects that the program is executing a tight loop (i.e., a

non-nested small loop), it will load it into the fast-loop look-aside cache. This opti-

mization speeds up instruction fetch and reduces power, since the caches and

branch predictors can be in a low-power sleep mode while the tight loop is execut-

ing.

The output of the instruction issue unit flows into the decoders, which deter-

mine which resources and inputs are needed by the instructions. Like the Core i7,

Next Page

Structured Computer Organization

Search WWH ::

Custom Search

Home