Decoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

Alternative techniques to tackle conflict misses include having separate luma and

chroma caches. Similarly offsetting the memory map such that the same location in

successive frames maps to different cache lines can also reduce conflicts. For our

chosen configuration, the added complexity for these techniques outweighed the

observed hit-rate increases.

10.6.2

Four-Parallel Cache Architecture

This section describes a four parallel MC cache architecture. Datapath parallelism

and outstanding request queues for hiding the variable DRAM latency ensure a high

throughput. As seen in Fig. 10.17 , there are four parallel paths each outputting up to

32 pixels (1 MAU) per cycle.

10.6.2.1

Four-Parallel Data Flow

The parallelism in the cache datapath allows up to 4 MAUs in a row to be

processed simultaneously. The MC cache must fetch at most 23 23 reference region

corresponding to a 16 16 PU, which is the largest PU processed by Inter Prediction

(see Sect. 10.5.1 ). This may require up to seven cycles as shown in Fig. 10.16 .The

address translation unit in Fig. 10.17 reorders the MAUs based on the lowest 2 bits

of the column address. This maps each request to a unique datapath and allows us

to split the tag register file and cache SRAM into four smaller pieces. Note that

this design cannot output 2 MAUs in the same column on the same cycle. Thus our

design trades unused flexibility in addressing for smaller tag-register and SRAM

sizes.

The cache tags for the missed cache lines are immediately updated when the lines

are requested from DRAM. This preemptive update ensures that future reads to the

same cache line do not result in multiple requests to the DRAM. Note that behavior

is similar to a simple non-blocking cache and does not involve any speculation.

Additionally since the MC cache is a read only cache there is no need for write-

back in case of eviction from the cache.

10.6.2.2

Queue Management and Hazard Control

Each datapath has independent read and write queues which help absorb the variable

DRAM latency. The 32 deep read queue stores pending requests to the SRAM. The

eight deep write queue stores pending cache misses which are yet to be resolved by

the DRAM. The write queue is shorter because fewer cache misses are expected.

Thus the cache allows for up to 32 pending requests to the DRAM. At the system

level the latency of fetching the data from the DRAM is hidden by allowing for a

separate motion vector (MV) dispatch stage in the pipeline prior to the Prediction

Search WWH ::

Custom Search

Home