Graphics Reference
In-Depth Information
one macroblock can contain only one kind of intra block size, which can be used to
design optimized pipeline schedules as in [ 7 , 24 ]. Since a CTU in HEVC can have a
variety of TUs and a mix of intra and inter CUs, such pipeline schedules will be too
complex to optimize for every possible combination.
As the result, designing a data-flow that respects across-TU dependencies and
provides high throughput is a bigger challenge than the pixel computation involved
in reference preparation and prediction. In this chapter, we focus on the data-flow
management used in [ 8 ], which uses a hierarchical memory deployment for high
throughput and low area. The intra engine operates on blocks of 32 32 luma pixels
and two 16 16 chroma pixels since those are the largest TU sizes. In the complete
decoder pipeline, it communicates with entropy decoder and inverse transform at a
Variable-sized Pipeline Block (VPB) granularity. (The mapping between VPB and
CTU is shown in Table 10.1 .For16 16 CTU, four CTUs are combined into one
intra pipeline block.)
10.7.1
Hierarchical Memory Deployment
The bottom row pixels of all VPBs in a row of VPBs needs to be stored since
they are top neighbors for VPBs in the row below. This buffer must be sized
proportional to the picture width and may be implemented in on-chip SRAM or
external DRAM. Storing VPB-level neighboring pixels in registers as previous
designs for H.264/AVC have done can provide the required high-throughput access.
But this will require a lot of area as the VPB can be as large as 64 64. This issue can
be addressed by storing the neighboring pixels in SRAM to save area and storing
them in registers at a TU level for higher throughput. A memory hierarchy is thus
formed:
1. VPB-row-level top neighbors in SRAM or external memory
2. VPB-level neighboring pixels in SRAM
3. TU-level reference pixels in registers
The hierarchical memory deployment is shown in Fig. 10.20 and the memory
elements are explained next:
1. VPB-Row top neighbors: In [ 9 ], this buffer is implemented in an on-chip SRAM
that is shared with deblocking filter. The deblocking filter stores four top rows of
which, intra prediction uses one row.
2. VPB top neighbors: This buffer is implemented using a pair of SRAMs in a ping-
pong fashion. One SRAM is used in the intra-prediction of the current VPB. It
is updated every TU with neighboring pixels for the next TU. At the same time,
the other SRAM updates the VPB-Row top SRAM with pixels from the previous
VPB and loads top row pixels for the next VPB. The size of each SRAM is 192
pixels (64 Y top C 32 Y top-right C 64 UV top C 32 UV top-right).
Search WWH ::




Custom Search