Graphics Reference
In-Depth Information
Alternative techniques to tackle conflict misses include having separate luma and
chroma caches. Similarly offsetting the memory map such that the same location in
successive frames maps to different cache lines can also reduce conflicts. For our
chosen configuration, the added complexity for these techniques outweighed the
observed hit-rate increases.
10.6.2
Four-Parallel Cache Architecture
This section describes a four parallel MC cache architecture. Datapath parallelism
and outstanding request queues for hiding the variable DRAM latency ensure a high
throughput. As seen in Fig. 10.17 , there are four parallel paths each outputting up to
32 pixels (1 MAU) per cycle.
10.6.2.1
Four-Parallel Data Flow
The parallelism in the cache datapath allows up to 4 MAUs in a row to be
processed simultaneously. The MC cache must fetch at most 23 23 reference region
corresponding to a 16 16 PU, which is the largest PU processed by Inter Prediction
(see Sect. 10.5.1 ). This may require up to seven cycles as shown in Fig. 10.16 .The
address translation unit in Fig. 10.17 reorders the MAUs based on the lowest 2 bits
of the column address. This maps each request to a unique datapath and allows us
to split the tag register file and cache SRAM into four smaller pieces. Note that
this design cannot output 2 MAUs in the same column on the same cycle. Thus our
design trades unused flexibility in addressing for smaller tag-register and SRAM
sizes.
The cache tags for the missed cache lines are immediately updated when the lines
are requested from DRAM. This preemptive update ensures that future reads to the
same cache line do not result in multiple requests to the DRAM. Note that behavior
is similar to a simple non-blocking cache and does not involve any speculation.
Additionally since the MC cache is a read only cache there is no need for write-
back in case of eviction from the cache.
10.6.2.2
Queue Management and Hazard Control
Each datapath has independent read and write queues which help absorb the variable
DRAM latency. The 32 deep read queue stores pending requests to the SRAM. The
eight deep write queue stores pending cache misses which are yet to be resolved by
the DRAM. The write queue is shorter because fewer cache misses are expected.
Thus the cache allows for up to 32 pending requests to the DRAM. At the system
level the latency of fetching the data from the DRAM is hidden by allowing for a
separate motion vector (MV) dispatch stage in the pipeline prior to the Prediction
Search WWH ::




Custom Search