Decoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

(IDST). As compared to H.264/AVC, the HEVC inverse transform involves sig-

nificant challenges for hardware implementation. This is the result of the following

factors:

1. HEVC uses Transform Units (TUs) of size 4 4, 8 8, 16 16,and32 32

pixels. This variety of TU sizes complicates the design of control logic as TUs of

different sizes take different number of cycles for processing.

2. Like H.264/AVC, the 2-D transforms in HEVC are separable into 1-D transforms

along the columns and rows. An N N 2-D transform consists of N 1-D column

transforms and N 1-D row transforms, each of which can be viewed as the prod-

uct of an N N transform matrix with N 1 input coefficients. The total number

of multiplications is thus, 2N 3 or 2N per coefficient. Hence, the largest IDCT

in HEVC (32 32) takes 4 the number of multiplications per coefficient as

compared to the largest IDCT in H.264/AVC (8 8). Furthermore, the increased

precision in HEVC transforms doubles the cost of each multiplication. Hence,

HEVC transform logic has 8 the computational complexity of H.264/AVC.

3. An intermediate memory is needed to store the TU between the column and row

transforms operation. This memory must perform a transposition (i.e. columns

are written to it and rows are read out). Previous designs for H.264/AVC used

register arrays due to the small TU sizes. These do not scale very well to

the higher TU sizes of HEVC and one must look to denser memories such as

SRAM to achieve an area-efficient implementation. However, the higher density

of SRAMs comes at the cost of lower memory throughput and less flexibility in

read-write patterns.

A single-cycle 32-pt 1-D IDCT with Booth encoded shift-and-add multipliers

takes about 145 kgate of logic. For comparison, a complete 1080p H.264/AVC

decoder can be built in 160 kgate [ 11 ]. Hence, aggressive optimizations that exploit

various properties of the transform matrix are necessary to achieve a reasonable

area. Also, a single-cycle 32-pt IDCT provides much higher throughput than what

is required for real-time operation. It is possible to reduce the area by computing

the DCT over multiple cycles using partial matrix multiplication. A 2 pixel/cycle

throughput at 200 MHz is sufficient for 4K Ultra HD decode at 30 fps. The following

subsections describe such a design.

10.4.1

Top-Level Pipelining

In general, two high-level architectures are possible for a 2 pixel/cycle inverse

transform [ 4 ]. The first one, shown in Fig. 10.5 a uses separate stages for row

and column transforms. Each one has a throughput of 2 pixel/cycle and operates

concurrently. The dependency between the row and column transforms (all columns

of the TU must be processed before the row transform) means that the two stages

must process different TUs at the same time. The transpose memory must have one

Search WWH ::

Custom Search

Home