THE MICROARCHITECTURE LEVEL - Structured Computer Organization

Hardware Reference

In-Depth Information

the instructions are renamed after decode to eliminate WAR hazards that can slow

down out-of-order execution. After renaming, the instructions are placed into the

instruction dispatch queue, which will issue them when their inputs are ready for

the functional units, potentially out of order.

The instruction dispatch queue sends instructions to the functional units, as

shown in Fig. 4-48. The integer execution unit contains two ALUs as well as a

short pipeline for branch instructions. The physical register file, which holds ISA

registers and some scratch registers are also contained there. The Cortex A9

pipeline can optionally contain one more more compute engines as well, which act

as additional functional units. ARM supports a compute engine for floating-point

computation called VFP and integer SIMD vector computation called NEON .

The load/store unit handles various load and store instructions. It has paths to

the data cache and the store buffer. The data cache is a traditional 32-KB 4-way

associative L1 data cache using a 32-byte line size. The store buffer holds the

stores that have not yet written their value to the data cache (at retirement). A load

that executes will first try to fetch its value from the store buffer, using store-to-

load forwarding like that of the Core i7. If the value is not available in the store

buffer, it will fetch it from the data cache. One possible outcome of a load execut-

ing is an indication from the store buffer that it should wait, because an earlier

store with an unknown address is blocking its execution. In the event that the L1

data cache access misses, the memory block will be fetched from the unified L2

cache. Under certain circumstances, the Cortex A9 also performs hardware

prefetching out of the L2 cache into the L1 data cache, in order to improve the per-

formance of loads and stores.

The OMAP 4430 chip also contains logic for controlling memory access. This

logic is split into two parts: the system interface and the memory controller. The

system interface interfaces with the memory over a 32-bit-wide LPDDR2 bus. All

memory requests to the outside world pass through this interface. The LPDDR2

bus supports a 26-bit (word, not byte) address to 8 banks that return a 32-bit data

word. In theory, the main memory can be up to 2 GB per LPDDR2 channel. The

OMAP4430 has two of them, so it can address up to 4 GB of external RAM.

The memory controller maps 32-bit virtual addresses onto 32-bit physical ad-

dresses. The Cortex A9 supports virtual memory (discussed in Chap. 6), with a

4-KB page size. To speed up the mapping, special tables, called TLBs ( Transla-

tion Lookaside Buffers ), are provided to compare the current virtual address

being referenced to those referenced in the recent past. Two such tables are pro-

vided for mapping instruction and data addresses.

The OMAP4430's Cortex A9 Pipeline

The Cortex A9 has an 11-stage pipeline, illustrated in simplified form in

Fig. 4-49. The 11 stages are designated by short stage names shown on the left-

hand side of the figure. Let us now briefly examine each stage. The Fe 1 (Fetch

Search WWH ::

Custom Search

Home