Encoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC) - page 353

Graphics Reference

In-Depth Information

Search Window Width

Frame Width

Search range for

current CU

Search Range for

later use

Large

L3 SRAM

x1

AMVP0

(0,0)

AMVP1

Small

L1/L2

SRAM

xN

Fast

ME

Search

Fast

ME

Search

Used search range

Fig. 11.10

Data granularity in reference frame access

bandwidth and internal bandwidth. We can see the data characteristics at various

levels of data granularity as illustrated in Fig. 11.10 . For a given CU, IME and FME

that use fast algorithms may not access the whole search window memory. Instead,

only small portions of the search range are accessed. At the module level, IME

and FME do not require data outside the real search region. Larger memory results

in larger area, higher power, and higher area, hence it is not efficient to store the

whole search range for IME and FME use. For this reason, we use a multi-level and

multiple reference memory with each level optimally resized for the best efficiency.

A large L3 reference SRAM is used to enable level C C /level D style buffering for

lowest bandwidth overhead. For every pixel, the reference memory access reaches

one read per frame if deep level C C or level D is used. To support high concurrent

access on the memory ports at the module level, we use L2 and L1 SRAM. The ME

reference prefetch unit would fill the L2 SRAM for IME usage, and the L2 buffer

for FME reference broadcasting unit. For IME that uses subsampling, the SRAM

must be stored in subsampled pattern. The SRAM for IME is filled according to

subsampling order for storing reference pixels. The FME reference broadcasting

unit fills fully sampled L1 SRAM with data from the L2 buffer. With this scheme, all

the concurrent access requirements from IME and FME are supported. The memory

bandwidth is also minimized.

In addition, the architecture can be scaled up if more read ports are required.

The total SRAM size needed for increasing the number of read ports is shown in

Fig. 11.11 . As an example, assume one set of IME engines and four sets of FME

engines are used to meet certain design requirements. In this case, the reference

memory hierarchy needs to support a total of 17 ports. We may achieve this simply

by using 16 L1 SRAM with fully sampled pattern and 1 L1 SRAM with subsampled

pattern. As shown in Fig. 11.11 , if the number of ports increases by 30 % (i.e. from

13 to 17 ports), the additional reference memory size for supporting four more ports

is only 2:7 % (from 7:14 to 7:33 MB). Thus, this architecture has high read port

scalability.

Next Page

High Efficiency Video Coding (HEVC)

Search WWH ::

Custom Search

Home