Hardware Reference
In-Depth Information
Despite its bent for joule-frugal operation, the ARM A9 cores utilize a very ca-
pable microarchitecture. They can decode and execute up to two instructions each
cycle. As we will learn in Chap. 4, this execution rate represents the maximum
throughput of the microarchitecture. But do not expect it to execute this many in-
structions each cycle. Rather, think of this rate as the manufacturer's guaranteed
maximum performance, a level that the processor will never exceed, no matter
what. In many cycles, fewer than two instructions will execute due to the myriad of
''hazards'' that can stall instructions, leading to lower execution throughput. To ad-
dress many of these throughput limiters, the ARM A9 incorporates a powerful
branch predictor, out-of-order instruction scheduling, and a highly optimized mem-
ory system.
The OMAP4430's memory system has two main internal L1 caches for each
ARM A9 processor: a 32-KB cache for instructions and a 32-KB cache for data.
Like the Core i7, it also uses an on-chip level 2 (L2) cache, but unlike the Core i7,
it is a relatively tiny 1 MB in size, and it is shared by both ARM A9 cores. The
caches are fed with dual LPDDR2 low-power DRAM channels. LPDDR2 is
derived from the DDR2 memory interface standard, but changed to require fewer
wires and to operate at more power-efficient voltages. Additionally, the memory
controller incorporates a number of memory-access optimizations, such as tiled
memory prefetching and in-memory rotation support.
While we will discuss caching in detail in Chap. 4, a few words about it here
will be useful. All of main memory is divided up into cache lines (blocks) of 32
bytes. The 1024 most heavily used instruction lines and the 1024 most heavily
used data lines are in the level 1 cache. Cache lines that are heavily used but which
do not fit in the level 1 cache are kept in the level 2 cache. This cache contains
both data lines and instruction lines from both ARM A9 CPUs mixed at random.
The level 2 cache contains the most recently touched 32,768 lines in main memory.
On a level 1 cache miss, the CPU sends the identifier of the line it is looking
for (Tag address) to the level 2 cache. The reply (Tag data) provides the infor-
mation for the CPU to tell whether the line is in the level 2 cache, and if so, what
state it is in. If the line is cached there, the CPU goes and gets it. Getting a value
out of the level 2 cache takes 19 cycles. This is a long time to wait for data, so
clever programmers will optimize their programs to use less data, making it more
likely to find data in the fast level 1 cache.
If the cache line is not in the level 2 cache, it must be fetched from main mem-
ory via the LPDDR2 memory interface. The OMAP4430 LPDDR2 interface is
implemented on-chip such that LPDDR2 DRAM can be connected directly to the
OMAP4430. To access memory, the CPU must first send the upper portion of the
DRAM address to the DRAM chip, using the 13 address lines. This operation, cal-
led an ACTIVATE , loads an entire row of memory within the DRAM into a row
buffer. Subsequently, the CPU can issue multiple READ or WRITE commands, send-
ing the remainder of the address on the same 13 address lines, and sending (or re-
ceiving) the data for the operation on the 32 data lines.
Search WWH ::




Custom Search