Decoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

one macroblock can contain only one kind of intra block size, which can be used to

design optimized pipeline schedules as in [ 7 , 24 ]. Since a CTU in HEVC can have a

variety of TUs and a mix of intra and inter CUs, such pipeline schedules will be too

complex to optimize for every possible combination.

As the result, designing a data-flow that respects across-TU dependencies and

provides high throughput is a bigger challenge than the pixel computation involved

in reference preparation and prediction. In this chapter, we focus on the data-flow

management used in [ 8 ], which uses a hierarchical memory deployment for high

throughput and low area. The intra engine operates on blocks of 32 32 luma pixels

and two 16 16 chroma pixels since those are the largest TU sizes. In the complete

decoder pipeline, it communicates with entropy decoder and inverse transform at a

Variable-sized Pipeline Block (VPB) granularity. (The mapping between VPB and

CTU is shown in Table 10.1 .For16 16 CTU, four CTUs are combined into one

intra pipeline block.)

10.7.1

Hierarchical Memory Deployment

The bottom row pixels of all VPBs in a row of VPBs needs to be stored since

they are top neighbors for VPBs in the row below. This buffer must be sized

proportional to the picture width and may be implemented in on-chip SRAM or

external DRAM. Storing VPB-level neighboring pixels in registers as previous

designs for H.264/AVC have done can provide the required high-throughput access.

But this will require a lot of area as the VPB can be as large as 64 64. This issue can

be addressed by storing the neighboring pixels in SRAM to save area and storing

them in registers at a TU level for higher throughput. A memory hierarchy is thus

formed:

1. VPB-row-level top neighbors in SRAM or external memory

2. VPB-level neighboring pixels in SRAM

3. TU-level reference pixels in registers

The hierarchical memory deployment is shown in Fig. 10.20 and the memory

elements are explained next:

1. VPB-Row top neighbors: In [ 9 ], this buffer is implemented in an on-chip SRAM

that is shared with deblocking filter. The deblocking filter stores four top rows of

which, intra prediction uses one row.

2. VPB top neighbors: This buffer is implemented using a pair of SRAMs in a ping-

pong fashion. One SRAM is used in the intra-prediction of the current VPB. It

is updated every TU with neighboring pixels for the next TU. At the same time,

the other SRAM updates the VPB-Row top SRAM with pixels from the previous

VPB and loads top row pixels for the next VPB. The size of each SRAM is 192

pixels (64 Y top C 32 Y top-right C 64 UV top C 32 UV top-right).

Search WWH ::

Custom Search

Home