Encoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

difference between the current PU and the matching PU in the reference frames

constitutes the prediction error (a.k.a. residue) which is transformed and quantized

and coded in the bitstream.

In comparison with H.264/AVC, inter prediction in HEVC has three major

differences: (1) larger diversity in block size, (2) high complexity mode decision

is needed to achieve sufficient coding gain, and (3) longer sub-pixel interpolation

filter. In HEVC, the PU size may range from 4 8/8 4 to 64 64. Computation

complexity for deciding the best block partition also increases considerably. To

accurately choose the best mode among such high number of possible modes, full

RDO invoking more accurate distortion and bit estimation needs to be applied.

This requires inter predictions to preserve several possible modes for later HCMD

stage. HEVC utilizes 8 or 7-tap interpolation filter for higher interpolation accuracy

compared with 6-tap in H.264/AVC. So the complexity in sub-pixel calculation is

also higher. To cope with these complexity increases, higher parallelism in hardware

is necessary. This should be achieved with moderate cost increase. In addition,

the parallelism in hardware also induces much higher memory access bandwidth.

A memory subsystem that supports high bandwidth requirement is required to make

motion estimation work properly. These issues are covered later in this section.

11.3.1

Motion Estimation

Due to the difference in the processing nature, inter prediction is usually divided

into two major modules, integer motion estimation (IME) and fractional motion

estimation (FME), corresponding to two granularity levels, the integer level and the

fractional level. IME usually performs a coarse search over the whole search region.

In this level, the parallelism requirement is high, while the accuracy requirement

is moderate. After that, FME does a fine search around the IME searched result

in sub-pixel accuracy. 8 or 7-tap interpolation filtering is required to get the pixels

in the fractional positions. Since the distortion costs among neighboring sub-pixel

candidates are similar, higher accuracy in the distortion computation is required in

order to select the best candidate. The reference architecture is shown in Fig. 11.2 .

In previous works, various architectures for variable block size motion estimation

have been compared [ 7 , 24 ]. A fast gradient-based algorithm on a parallel 2D

SAD tree with high data reuse is described in [ 10 ]. Exploration in data reuse

for motion estimation is shown in [ 13 ]. To increase parallelism, a highly parallel

inter mode decision in HEVC is achieved by dependency removal in [ 41 ]. Finally,

[ 35 ] describes how throughput requirements can be met by processing multiple

CUs in parallel, but processing the PU within each CU serially to achieve the

same sequential order as in HM. The result shows small block sizes (e.g. 4 4,

4 8, 8 4) impose significantly larger hardware, but provide only modest

improvements in coding efficiency. In addition, a search range strategy centered on

the advanced motion vector predictors (AMVP) with pre-fetch and limited search

range movement is presented.

Search WWH ::

Custom Search

Home