Block Structures and Parallelism Features in HEVC - High Efficiency Video Coding (HEVC)

Graphics Reference

In-Depth Information

neighboring CTU in the preceding CTU row must have been finished. In other

words, at any time the thread processing of the preceding CTU row must have

processed two consecutive CTUs more than the thread processing of the current

CTU row, which results in a “wavefront” of CTUs rolling from the top left to

the bottom right corner of the picture, as illustrated in Fig. 3.12 . This is why

the wavefront dependencies do not allow all threads of processing CTU rows

to start decoding simultaneously. Consequently, the row-wise CTU processing

threads cannot finish decoding at the same time at the end of each row. This

introduces parallelization inefficiencies, referred to as ramping inefficiencies, that

become more evident with an increasing number of threads being used. Additional

pipelining issues might arise because of stalls from an inefficient load balancing

of CTUs. For example, a slow CTU in one CTU row could cause stalls in the

processing of subsequent CTU rows.

As an additional small overhead in processing, WPP requires to store the content

of all CABAC context variables after having finished encoding/decoding of the

second CTU in each CTU row. Beyond that, however, WPP does not require any

extra handling of partition borders to preserve the dependencies utilized by entropy

encoding/decoding, in-picture prediction or in-loop filtering. An example WPP

scheme for executing all HEVC decoder operations within the hybrid video coding

loop can be found in [ 6 ].

The header overhead associated with WPP can be kept small and may consist

only of signaling the partition entry point offsets via slice segment subsets or,

alternatively, the reduced slice segment header of dependent slice segments, as

discussed in Sect. 3.3.2.3 below. Together with the minor coding efficiency loss due

to the above-mentioned CABAC re-initialization by propagation of context variables

at the partition starting points, the resulting overall overhead of a WPP bitstream

is small compared to a non-parallel bitstream, while enabling a fair amount of

parallelism that scales with the picture resolution.

In order to maintain bitstream conformance for the CTU row partitioning

approach, a constraint has been set on the presence of slices and slice segments

in a CTU row. According to this constraint, it is required that the last CTU of a slice

or a slice segment that does not start with the first CTU of a CTU row belongs to

the same CTU row as the first CTU in the same slice or slice segment.

As already mentioned above, the scalability of wavefront parallel processing is

limited by the reduced number of independent CTUs at the beginning and at the

end of each picture. To overcome this limitation and increase the parallelization

scalability of WPP, a technique called overlapped wavefront (OWF) has been

proposed in [ 2 , 6 ], which is not part of the HEVC standard, but may be implemented

by using additional constraints on the encoding process. By using OWF, multiple

pictures can be decoded simultaneously such that a more constant parallelization

gain can be obtained. An in-depth analysis of the parallelization tools included

in HEVC has shown that when WPP is combined with the OWF algorithm, it

provides a better parallelization scalability than tiles. For quantitative details of the

comparison of WPP and tiles as well as the overlapping wavefront approach, the

reader is referred to the discussions and results in [ 6 ].

High Efficiency Video Coding (HEVC)

Search WWH ::

Custom Search

Home