Graphics Reference
In-Depth Information
(IDST). As compared to H.264/AVC, the HEVC inverse transform involves sig-
nificant challenges for hardware implementation. This is the result of the following
factors:
1. HEVC uses Transform Units (TUs) of size 4 4, 8 8, 16 16,and32 32
pixels. This variety of TU sizes complicates the design of control logic as TUs of
different sizes take different number of cycles for processing.
2. Like H.264/AVC, the 2-D transforms in HEVC are separable into 1-D transforms
along the columns and rows. An N N 2-D transform consists of N 1-D column
transforms and N 1-D row transforms, each of which can be viewed as the prod-
uct of an N N transform matrix with N 1 input coefficients. The total number
of multiplications is thus, 2N 3 or 2N per coefficient. Hence, the largest IDCT
in HEVC (32 32) takes 4 the number of multiplications per coefficient as
compared to the largest IDCT in H.264/AVC (8 8). Furthermore, the increased
precision in HEVC transforms doubles the cost of each multiplication. Hence,
HEVC transform logic has 8 the computational complexity of H.264/AVC.
3. An intermediate memory is needed to store the TU between the column and row
transforms operation. This memory must perform a transposition (i.e. columns
are written to it and rows are read out). Previous designs for H.264/AVC used
register arrays due to the small TU sizes. These do not scale very well to
the higher TU sizes of HEVC and one must look to denser memories such as
SRAM to achieve an area-efficient implementation. However, the higher density
of SRAMs comes at the cost of lower memory throughput and less flexibility in
read-write patterns.
A single-cycle 32-pt 1-D IDCT with Booth encoded shift-and-add multipliers
takes about 145 kgate of logic. For comparison, a complete 1080p H.264/AVC
decoder can be built in 160 kgate [ 11 ]. Hence, aggressive optimizations that exploit
various properties of the transform matrix are necessary to achieve a reasonable
area. Also, a single-cycle 32-pt IDCT provides much higher throughput than what
is required for real-time operation. It is possible to reduce the area by computing
the DCT over multiple cycles using partial matrix multiplication. A 2 pixel/cycle
throughput at 200 MHz is sufficient for 4K Ultra HD decode at 30 fps. The following
subsections describe such a design.
10.4.1
Top-Level Pipelining
In general, two high-level architectures are possible for a 2 pixel/cycle inverse
transform [ 4 ]. The first one, shown in Fig. 10.5 a uses separate stages for row
and column transforms. Each one has a throughput of 2 pixel/cycle and operates
concurrently. The dependency between the row and column transforms (all columns
of the TU must be processed before the row transform) means that the two stages
must process different TUs at the same time. The transpose memory must have one
Search WWH ::




Custom Search