Graphics Reference
In-Depth Information
In [ 9 ], a VPB-based pipelining is used between deblocking filter and prediction
stages. This allows the scheduling within the deblocking filter to be scheduled
independent of the coding tree structure. A smaller granularity can also be used to
save pipeline buffer SRAM at the cost of scheduling complexity. Since the in-loop
filtering process for the current block of pixels depends on blocks to the right and
bottom which have not yet been reconstructed, the entire block cannot be processed
completely. The output of the deblocking filter is shifted from the input by four luma
pixels and two chroma pixels to the left and the top, and the output of SAO is shifted
by another pixel for all color components in both directions.
10.8.1
Deblocking Filter
Compared to H.264/AVC, HEVC's deblocking filter has several simplifications
related to processing dependencies. The luma deblocking filter operates on edges
lying on an 8 8 grid and filter takes 4 pixels on either side of the edge as input
and writes up to 3 pixels on either side. As a result, unlike H.264/AVC, filters on
adjacent edges are completely decoupled and it is possible to filter 8 8 pixel blocks
independently. The key challenge in the deblocking filter architecture is designing
an efficient data flow to handle cross-CTU dependencies.
The bottom four rows and right-most four columns of luma pixels (and two
rows and columns of chroma pixels) in a CTU depend on the CTUs to the bottom,
right and bottom-right for their deblocking. Accordingly, their processing must be
delayed until those CTUs are available and they must be temporarily stored until
then. Along with the pixels, parameters such as prediction mode, motion vectors, TU
and PU boundaries, and quantization parameter which are required for computing
the boundary strength also need temporary storage. The right-most four columns
need a 1-CTU-high buffer (called Last CTU buffer) while the bottom four rows
need a 1-Picture-wide buffer (called Line buffer).
The boundary strength parameters are available at a worst-case granularity of
4 4 pixels and take about 78 bits (64 bits for two motion vectors, 4 bits for two
reference list indices, 6 bits for quantization parameter, 2 bits for prediction mode—
intra-prediction, uni-prediction, bi-prediction—and one bit each for TU boundary,
PU boundary). For example, for a 4K Ultra-HD (3;840 2;160) picture and 64 64
CTU, the Last CTU buffer must hold 64 4 luma pixels, 2 32 2 chroma pixels and
16 boundary strength parameters resulting in a total of 4,320 bits. The Line buffer
must hold 3;840 4 luma pixels, 2 1;920 2 chroma pixels and 960 boundary
strength parameters resulting in a total of 96 kbit. While the Last CTU buffer can
be stored in registers or SRAM, it might be necessary to store the Line buffer in
external DRAM depending on area constraints. However, due to the regular access
pattern on the Line buffer, it is possible to prefetch the data and hide the DRAM
bandwidth (at the cost of on-chip memory for request and response queues to and
from the DRAM).
Search WWH ::




Custom Search