Graphics Reference
In-Depth Information
￿
The inverse transform block is considerably more complicated due to the large
TU sizes and higher precision of the transform matrix. The largest TU size (32
32) requires a 16 larger transpose memory.
￿
HEVC uses an 8-tap luma interpolation filter for motion compensation as
compared to the 6-tap filter in H.264/AVC. This increases the bandwidth required
from the decoded picture buffer.
The architecture of the video decoder depends strongly on parameters such as
the required throughput (i.e. pixel rate defined by the level limit in the HEVC
specification), technology node, area and power budgets, control and data interface
to the external world and memory technology used for the decoded picture buffer.
In this chapter, we describe the architecture for an HEVC decoder for 4K Ultra
HD decoding at 30 fps designed in 40 nm CMOS technology with external DDR3
memory for the decoded picture buffer. The decoder operates at 200 MHz and is
frequency-scalable for lower resolutions and picture rates. Along with techniques
used in H.264/AVC decoders, such as frame-level parallelism [ 29 ] and reference
frame compression [ 20 ], and general VLSI techniques such as pipelining and
dynamic voltage and frequency scaling, HEVC decoders can benefit from archi-
tectural techniques like:
￿
Variable-size pipelining to reduce on-chip SRAM and handle different CTU
sizes.
￿
Unified processing engines for prediction and transform to manage the large
diversity of PU and TU sizes.
￿
High-throughput motion compensation (MC) cache to address increased DRAM
requirements for the longer interpolation filters.
10.2
System Pipeline
The granularity of the top-level pipeline is affected by processing dependencies
between pixels. For example, computing the luma residue at any pixel location
requires all transform coefficients in the TU that contains the pixel. Hence, it is
not possible for the inverse transform block to use, say, a 4 4 pixel pipeline; the
pipeline granularity must be at least one TU in size. In general, it is desirable to
minimize the pipeline granularity to reduce processing latency and memory sizes.
The largest CTU needs 6 kB to store its luma and chroma pixels with 8-
bit precision. The transform coefficients and residue are computed with higher
precision (16-bit and 9-bit, respectively) and require larger storage accordingly.
Other information such as intra-prediction mode, inter-prediction motion vectors,
etc. needs to be stored at a 4 4 granularity. All of these require large pipeline buffers
in SRAM and several techniques can be used to reduce their size as described in this
chapter.
Line buffers are required to handle data dependencies between CTUs in the
vertical direction. For example, the deblocking filter needs to store four rows of
Search WWH ::




Custom Search