Graphics Reference
In-Depth Information
Search Window Width
Frame Width
Search range for
current CU
Search Range for
later use
Large
L3 SRAM
x1
AMVP0
(0,0)
AMVP1
Small
L1/L2
SRAM
xN
Fast
ME
Search
Fast
ME
Search
Used search range
Fig. 11.10
Data granularity in reference frame access
bandwidth and internal bandwidth. We can see the data characteristics at various
levels of data granularity as illustrated in Fig. 11.10 . For a given CU, IME and FME
that use fast algorithms may not access the whole search window memory. Instead,
only small portions of the search range are accessed. At the module level, IME
and FME do not require data outside the real search region. Larger memory results
in larger area, higher power, and higher area, hence it is not efficient to store the
whole search range for IME and FME use. For this reason, we use a multi-level and
multiple reference memory with each level optimally resized for the best efficiency.
A large L3 reference SRAM is used to enable level C C /level D style buffering for
lowest bandwidth overhead. For every pixel, the reference memory access reaches
one read per frame if deep level C C or level D is used. To support high concurrent
access on the memory ports at the module level, we use L2 and L1 SRAM. The ME
reference prefetch unit would fill the L2 SRAM for IME usage, and the L2 buffer
for FME reference broadcasting unit. For IME that uses subsampling, the SRAM
must be stored in subsampled pattern. The SRAM for IME is filled according to
subsampling order for storing reference pixels. The FME reference broadcasting
unit fills fully sampled L1 SRAM with data from the L2 buffer. With this scheme, all
the concurrent access requirements from IME and FME are supported. The memory
bandwidth is also minimized.
In addition, the architecture can be scaled up if more read ports are required.
The total SRAM size needed for increasing the number of read ports is shown in
Fig. 11.11 . As an example, assume one set of IME engines and four sets of FME
engines are used to meet certain design requirements. In this case, the reference
memory hierarchy needs to support a total of 17 ports. We may achieve this simply
by using 16 L1 SRAM with fully sampled pattern and 1 L1 SRAM with subsampled
pattern. As shown in Fig. 11.11 , if the number of ports increases by 30 % (i.e. from
13 to 17 ports), the additional reference memory size for supporting four more ports
is only 2:7 % (from 7:14 to 7:33 MB). Thus, this architecture has high read port
scalability.
 
Search WWH ::




Custom Search