Hardware Reference
In-Depth Information
the instructions are renamed after decode to eliminate WAR hazards that can slow
down out-of-order execution. After renaming, the instructions are placed into the
instruction dispatch queue, which will issue them when their inputs are ready for
the functional units, potentially out of order.
The instruction dispatch queue sends instructions to the functional units, as
shown in Fig. 4-48. The integer execution unit contains two ALUs as well as a
short pipeline for branch instructions. The physical register file, which holds ISA
registers and some scratch registers are also contained there. The Cortex A9
pipeline can optionally contain one more more compute engines as well, which act
as additional functional units. ARM supports a compute engine for floating-point
computation called VFP and integer SIMD vector computation called NEON .
The load/store unit handles various load and store instructions. It has paths to
the data cache and the store buffer. The data cache is a traditional 32-KB 4-way
associative L1 data cache using a 32-byte line size. The store buffer holds the
stores that have not yet written their value to the data cache (at retirement). A load
that executes will first try to fetch its value from the store buffer, using store-to-
load forwarding like that of the Core i7. If the value is not available in the store
buffer, it will fetch it from the data cache. One possible outcome of a load execut-
ing is an indication from the store buffer that it should wait, because an earlier
store with an unknown address is blocking its execution. In the event that the L1
data cache access misses, the memory block will be fetched from the unified L2
cache. Under certain circumstances, the Cortex A9 also performs hardware
prefetching out of the L2 cache into the L1 data cache, in order to improve the per-
formance of loads and stores.
The OMAP 4430 chip also contains logic for controlling memory access. This
logic is split into two parts: the system interface and the memory controller. The
system interface interfaces with the memory over a 32-bit-wide LPDDR2 bus. All
memory requests to the outside world pass through this interface. The LPDDR2
bus supports a 26-bit (word, not byte) address to 8 banks that return a 32-bit data
word. In theory, the main memory can be up to 2 GB per LPDDR2 channel. The
OMAP4430 has two of them, so it can address up to 4 GB of external RAM.
The memory controller maps 32-bit virtual addresses onto 32-bit physical ad-
dresses. The Cortex A9 supports virtual memory (discussed in Chap. 6), with a
4-KB page size. To speed up the mapping, special tables, called TLBs ( Transla-
tion Lookaside Buffers ), are provided to compare the current virtual address
being referenced to those referenced in the recent past. Two such tables are pro-
vided for mapping instruction and data addresses.
The OMAP4430's Cortex A9 Pipeline
The Cortex A9 has an 11-stage pipeline, illustrated in simplified form in
Fig. 4-49. The 11 stages are designated by short stage names shown on the left-
hand side of the figure. Let us now briefly examine each stage. The Fe 1 (Fetch
 
Search WWH ::




Custom Search