Decoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

The inverse transform block is considerably more complicated due to the large

TU sizes and higher precision of the transform matrix. The largest TU size (32

32) requires a 16 larger transpose memory.

HEVC uses an 8-tap luma interpolation filter for motion compensation as

compared to the 6-tap filter in H.264/AVC. This increases the bandwidth required

from the decoded picture buffer.

The architecture of the video decoder depends strongly on parameters such as

the required throughput (i.e. pixel rate defined by the level limit in the HEVC

specification), technology node, area and power budgets, control and data interface

to the external world and memory technology used for the decoded picture buffer.

In this chapter, we describe the architecture for an HEVC decoder for 4K Ultra

HD decoding at 30 fps designed in 40 nm CMOS technology with external DDR3

memory for the decoded picture buffer. The decoder operates at 200 MHz and is

frequency-scalable for lower resolutions and picture rates. Along with techniques

used in H.264/AVC decoders, such as frame-level parallelism [ 29 ] and reference

frame compression [ 20 ], and general VLSI techniques such as pipelining and

dynamic voltage and frequency scaling, HEVC decoders can benefit from archi-

tectural techniques like:

Variable-size pipelining to reduce on-chip SRAM and handle different CTU

sizes.

Unified processing engines for prediction and transform to manage the large

diversity of PU and TU sizes.

High-throughput motion compensation (MC) cache to address increased DRAM

requirements for the longer interpolation filters.

10.2

System Pipeline

The granularity of the top-level pipeline is affected by processing dependencies

between pixels. For example, computing the luma residue at any pixel location

requires all transform coefficients in the TU that contains the pixel. Hence, it is

not possible for the inverse transform block to use, say, a 4 4 pixel pipeline; the

pipeline granularity must be at least one TU in size. In general, it is desirable to

minimize the pipeline granularity to reduce processing latency and memory sizes.

The largest CTU needs 6 kB to store its luma and chroma pixels with 8-

bit precision. The transform coefficients and residue are computed with higher

precision (16-bit and 9-bit, respectively) and require larger storage accordingly.

Other information such as intra-prediction mode, inter-prediction motion vectors,

etc. needs to be stored at a 4 4 granularity. All of these require large pipeline buffers

in SRAM and several techniques can be used to reduce their size as described in this

chapter.

Line buffers are required to handle data dependencies between CTUs in the

vertical direction. For example, the deblocking filter needs to store four rows of

Search WWH ::

Custom Search

Home