Graphics Reference
In-Depth Information
Table 11.1 The comparison of different data reuse schemes for ME [ 8 ]
Reuse scheme
EMB (Pixels/Pixel)
On-chip memory size (Pixels)
Level C
1
C
SR V =N
.SR H
C
N
1/
.SR V
C
N
1/
Level CC
1 C SR V =nN
.SR H
C mN
1/ .SR V
C nN
1/
Level D
1
.SR H
C W
1/ .SR V
1/
EMB: External Memory Bandwidth of reference frame
SR H : horizontal search range SR V : vertical search range
N : current CTU size n: zigzag stitch number W : frame width
redundant access is saved, while the SRAM buffer will also be larger. By varying
n,theLevelC C scheme provides a continuous SRAM and bandwidth trade-off and
can adapt to the design requirements. In Table 11.1 , the comparison is listed for the
Level C, the Level C C , and the Level D schemes. Note that cache-based scheme
may also be applied to raster scan order or zigzag scan order in Level C C to save
more SRAM.
To illustrate a practical case, we take 8K UHDTV as an example. Assume motion
estimation supports [ 128, C 127] search range and totally two reference frames. If
we use the level C data reuse strategy, the bandwidth requirement will be as high
as 11:61 GB/s for reference memory access. The level C strategy will discard the
reference pixels that are out of the search range for the current CTU, and reload them
later on while processing CTUs in the next CTU row, and thus will still consume
high bandwidth in ultra high resolution sequences. The level C C scheme or the
level D scheme use reduced bandwidth. For the level D scheme, the bandwidth can
be reduced to 2:97 GB/s.
From a module-level view, inter prediction in HEVC is very complex and
requires high degree of parallelism. This also imposes significant internal on-chip
memory bandwidth and multi-port access requirement. If high complexity mode
decision is used, there will be multiple refinement levels of CU depth that need
to be searched in fractional motion estimation (FME) stage. For each CU depth,
merge candidates need to be searched and additional memory ports are required.
In addition, if interleaving interpolation is used in FME stage, 2 the number of
ports is required per refinement level. As a result, the required number of ports will
be much higher than that in previous H.264/AVC encoders. For instance, a total 13
ports is required if IME and three-level refinement of FME are operating in parallel.
Considering the system-level and module-level view, a level C C or a level D
search window memory with high number of ports is required. However, highly-
ported and large-sized memory is costly. SRAM banking is an alternatively used
technique for increasing access parallelism. The case for SRAM banking is shown
in Fig. 11.8 . Each column is put on a separate SRAM bank. Each row corresponds
to a separate SRAM address. For example, A1 is put on address #1 at bank #A,
and C4 is put on address #4 in SRAM bank #C . With banking, we may access
any combination that does not have bank conflict. For example, {B3,C4,D5,E6}
can be read out in one cycle without conflict, while {D5,E5,D6,E6} cannot since
{D5,D6} uses the same bank D,and{E5,E6} uses the same bank E. Each
bank may serve only one read address per cycle. Thus, the read operation must
 
Search WWH ::




Custom Search