Decoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

In [ 9 ], a VPB-based pipelining is used between deblocking filter and prediction

stages. This allows the scheduling within the deblocking filter to be scheduled

independent of the coding tree structure. A smaller granularity can also be used to

save pipeline buffer SRAM at the cost of scheduling complexity. Since the in-loop

filtering process for the current block of pixels depends on blocks to the right and

bottom which have not yet been reconstructed, the entire block cannot be processed

completely. The output of the deblocking filter is shifted from the input by four luma

pixels and two chroma pixels to the left and the top, and the output of SAO is shifted

by another pixel for all color components in both directions.

10.8.1

Deblocking Filter

Compared to H.264/AVC, HEVC's deblocking filter has several simplifications

related to processing dependencies. The luma deblocking filter operates on edges

lying on an 8 8 grid and filter takes 4 pixels on either side of the edge as input

and writes up to 3 pixels on either side. As a result, unlike H.264/AVC, filters on

adjacent edges are completely decoupled and it is possible to filter 8 8 pixel blocks

independently. The key challenge in the deblocking filter architecture is designing

an efficient data flow to handle cross-CTU dependencies.

The bottom four rows and right-most four columns of luma pixels (and two

rows and columns of chroma pixels) in a CTU depend on the CTUs to the bottom,

right and bottom-right for their deblocking. Accordingly, their processing must be

delayed until those CTUs are available and they must be temporarily stored until

then. Along with the pixels, parameters such as prediction mode, motion vectors, TU

and PU boundaries, and quantization parameter which are required for computing

the boundary strength also need temporary storage. The right-most four columns

need a 1-CTU-high buffer (called Last CTU buffer) while the bottom four rows

need a 1-Picture-wide buffer (called Line buffer).

The boundary strength parameters are available at a worst-case granularity of

4 4 pixels and take about 78 bits (64 bits for two motion vectors, 4 bits for two

reference list indices, 6 bits for quantization parameter, 2 bits for prediction mode—

intra-prediction, uni-prediction, bi-prediction—and one bit each for TU boundary,

PU boundary). For example, for a 4K Ultra-HD (3;840 2;160) picture and 64 64

CTU, the Last CTU buffer must hold 64 4 luma pixels, 2 32 2 chroma pixels and

16 boundary strength parameters resulting in a total of 4,320 bits. The Line buffer

must hold 3;840 4 luma pixels, 2 1;920 2 chroma pixels and 960 boundary

strength parameters resulting in a total of 96 kbit. While the Last CTU buffer can

be stored in registers or SRAM, it might be necessary to store the Line buffer in

external DRAM depending on area constraints. However, due to the regular access

pattern on the Line buffer, it is possible to prefetch the data and hide the DRAM

bandwidth (at the cost of on-chip memory for request and response queues to and

from the DRAM).

Search WWH ::

Custom Search

Home