Hardware Reference
In-Depth Information
the key components are similar to those used in the Core i7. The similarities are
driven mostly by technology, power constraints, and economics. For example,
both designs employ a multilevel cache hierarchy to meet the tight cost constraints
of typical embedded applications; however, the last level of the Cortex A9's cache
memory system (L2) is only 1 MB in size, significantly smaller than the Core i7
which supports last level caches (L3) of up to 20 MB. The differences, in contrast,
are due mostly to the difference between having to bridge the gap between an old
CISC instruction set and a modern RISC core and not having to do so.
To LPDDR2
memory
Level 1
inst cache
Fast loop
look-aside
Branch predictor
/
Branch target
address
cache
System
interface
Memory
controller
Instruction issue unit/
decoder/renamer
Level 2
unified
cache
Instruction
queue
Level 1
data cache
Load-store unit/
store buffer
ALUs
FPUs
Retirement
Figure 4-48. The block diagram of the OMAP4430's Cortex A9 microarchitecture.
At the top of Fig. 4-48 is the 32-KB 4-way associative instruction cache, which
uses 32-byte cache lines. Since most ARM instructions are 4 bytes, there is room
for about 8K instructions here in this cache, quite a bit larger than the Core i7's
micro-op cache.
The instruction issue unit prepares up to four instructions for execution per
clock cycle. If there is a miss on the L1 cache, fewer instructions will be issued.
When a conditional branch is encountered, a branch predictor with 4K entries is
consulted to predict whether or not the branch will be taken. If predicted taken, the
1K entry branch-target-address cache is consulted for the predicted target address.
In addition, if the front end detects that the program is executing a tight loop (i.e., a
non-nested small loop), it will load it into the fast-loop look-aside cache. This opti-
mization speeds up instruction fetch and reduces power, since the caches and
branch predictors can be in a low-power sleep mode while the tight loop is execut-
ing.
The output of the instruction issue unit flows into the decoders, which deter-
mine which resources and inputs are needed by the instructions. Like the Core i7,
 
 
Search WWH ::




Custom Search