Encoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

Table 11.1 The comparison of different data reuse schemes for ME [ 8 ]

Reuse scheme

EMB (Pixels/Pixel)

On-chip memory size (Pixels)

Level C

1

C

SR V =N

.SR H

C

N

1/

.SR V

C

N

1/

Level CC

1 C SR V =nN

.SR H

C mN

1/ .SR V

C nN

1/

Level D

1

.SR H

C W

1/ .SR V

1/

EMB: External Memory Bandwidth of reference frame

SR H : horizontal search range SR V : vertical search range

N : current CTU size n: zigzag stitch number W : frame width

redundant access is saved, while the SRAM buffer will also be larger. By varying

n,theLevelC C scheme provides a continuous SRAM and bandwidth trade-off and

can adapt to the design requirements. In Table 11.1 , the comparison is listed for the

Level C, the Level C C , and the Level D schemes. Note that cache-based scheme

may also be applied to raster scan order or zigzag scan order in Level C C to save

more SRAM.

To illustrate a practical case, we take 8K UHDTV as an example. Assume motion

estimation supports [ 128, C 127] search range and totally two reference frames. If

we use the level C data reuse strategy, the bandwidth requirement will be as high

as 11:61 GB/s for reference memory access. The level C strategy will discard the

reference pixels that are out of the search range for the current CTU, and reload them

later on while processing CTUs in the next CTU row, and thus will still consume

high bandwidth in ultra high resolution sequences. The level C C scheme or the

level D scheme use reduced bandwidth. For the level D scheme, the bandwidth can

be reduced to 2:97 GB/s.

From a module-level view, inter prediction in HEVC is very complex and

requires high degree of parallelism. This also imposes significant internal on-chip

memory bandwidth and multi-port access requirement. If high complexity mode

decision is used, there will be multiple refinement levels of CU depth that need

to be searched in fractional motion estimation (FME) stage. For each CU depth,

merge candidates need to be searched and additional memory ports are required.

In addition, if interleaving interpolation is used in FME stage, 2 the number of

ports is required per refinement level. As a result, the required number of ports will

be much higher than that in previous H.264/AVC encoders. For instance, a total 13

ports is required if IME and three-level refinement of FME are operating in parallel.

Considering the system-level and module-level view, a level C C or a level D

search window memory with high number of ports is required. However, highly-

ported and large-sized memory is costly. SRAM banking is an alternatively used

technique for increasing access parallelism. The case for SRAM banking is shown

in Fig. 11.8 . Each column is put on a separate SRAM bank. Each row corresponds

to a separate SRAM address. For example, A1 is put on address #1 at bank #A,

and C4 is put on address #4 in SRAM bank #C . With banking, we may access

any combination that does not have bank conflict. For example, {B3,C4,D5,E6}

can be read out in one cycle without conflict, while {D5,E5,D6,E6} cannot since

{D5,D6} uses the same bank D,and{E5,E6} uses the same bank E. Each

bank may serve only one read address per cycle. Thus, the read operation must

High Efficiency Video Coding (HEVC)

Search WWH ::

Custom Search

Home