Graphics Reference
In-Depth Information
neighboring CTU in the preceding CTU row must have been finished. In other
words, at any time the thread processing of the preceding CTU row must have
processed two consecutive CTUs more than the thread processing of the current
CTU row, which results in a “wavefront” of CTUs rolling from the top left to
the bottom right corner of the picture, as illustrated in Fig. 3.12 . This is why
the wavefront dependencies do not allow all threads of processing CTU rows
to start decoding simultaneously. Consequently, the row-wise CTU processing
threads cannot finish decoding at the same time at the end of each row. This
introduces parallelization inefficiencies, referred to as ramping inefficiencies, that
become more evident with an increasing number of threads being used. Additional
pipelining issues might arise because of stalls from an inefficient load balancing
of CTUs. For example, a slow CTU in one CTU row could cause stalls in the
processing of subsequent CTU rows.
As an additional small overhead in processing, WPP requires to store the content
of all CABAC context variables after having finished encoding/decoding of the
second CTU in each CTU row. Beyond that, however, WPP does not require any
extra handling of partition borders to preserve the dependencies utilized by entropy
encoding/decoding, in-picture prediction or in-loop filtering. An example WPP
scheme for executing all HEVC decoder operations within the hybrid video coding
loop can be found in [ 6 ].
The header overhead associated with WPP can be kept small and may consist
only of signaling the partition entry point offsets via slice segment subsets or,
alternatively, the reduced slice segment header of dependent slice segments, as
discussed in Sect. 3.3.2.3 below. Together with the minor coding efficiency loss due
to the above-mentioned CABAC re-initialization by propagation of context variables
at the partition starting points, the resulting overall overhead of a WPP bitstream
is small compared to a non-parallel bitstream, while enabling a fair amount of
parallelism that scales with the picture resolution.
In order to maintain bitstream conformance for the CTU row partitioning
approach, a constraint has been set on the presence of slices and slice segments
in a CTU row. According to this constraint, it is required that the last CTU of a slice
or a slice segment that does not start with the first CTU of a CTU row belongs to
the same CTU row as the first CTU in the same slice or slice segment.
As already mentioned above, the scalability of wavefront parallel processing is
limited by the reduced number of independent CTUs at the beginning and at the
end of each picture. To overcome this limitation and increase the parallelization
scalability of WPP, a technique called overlapped wavefront (OWF) has been
proposed in [ 2 , 6 ], which is not part of the HEVC standard, but may be implemented
by using additional constraints on the encoding process. By using OWF, multiple
pictures can be decoded simultaneously such that a more constant parallelization
gain can be obtained. An in-depth analysis of the parallelization tools included
in HEVC has shown that when WPP is combined with the OWF algorithm, it
provides a better parallelization scalability than tiles. For quantitative details of the
comparison of WPP and tiles as well as the overlapping wavefront approach, the
reader is referred to the discussions and results in [ 6 ].
Search WWH ::




Custom Search