Graphics Reference
In-Depth Information
Thus, typically the maximum speed-up factor using p processing units is p.Since
the actual speed-up is influenced by various characteristics such as synchronization
between threads, memory access or memory characteristics, an optimized imple-
mentation would avoid main memory access, assuming a hierarchical memory
structure, as much as possible and instead rely on cache memory access, as the
latter typically has much faster access times compared to the main memory albeit at
smaller memory sizes.
Different parallelization techniques have been introduced for improved utiliza-
tion of computational resources in the implementation of video coding standards. In
the following only the most important ones are mentioned:
￿
Picture-Level Parallelization : Picture-level parallelism consists of processing
multiple pictures at the same time provided that the temporal dependencies for
motion-compensated prediction are satisfied. Picture-level parallelism is often
sufficient for multi-core systems with a few cores. Because it is relatively
simple to implement and does not incur coding efficiency losses, it has become
the state-of-the art for software-based implementations of H.264 j MPEG-4
AVC. However, picture-level parallelism has a number of limitations. First, the
parallelization scalability is determined by the lengths of the motion vectors
and/or the size of the underlying group of pictures (GOP). Second, the workload
of each core may be imbalanced because the picture encoding/decoding times
can vary significantly. Finally, picture-level parallelism increases the processing
frame rate but does not improve latency.
￿
Slice-Level Parallelization : In HEVC and H.264 j MPEG-4 AVC, each picture
can be partitioned into slices, as described in Sect. 3.3.1 . All (regular) slices
within a picture are independent from each other except for potential depen-
dencies regarding cross-slice border in-loop filtering. Therefore, slices can be
used for parallel processing. Slice-level parallelism, however, has a number of
disadvantages. Although slices are completely independent from each other in
terms of prediction, transform and entropy coding, in-loop filtering may be
applied across slice boundaries. For H.264 j MPEG-4 AVC it may be required
to perform deblocking of the complete picture using a single processing unit,
whereas HEVC, in principle, allows in-loop filtering to be performed on CTU
rows in parallel. Moreover, as already mentioned above, multiple slices reduce
the coding efficiency significantly due to the restrictions of in-picture prediction
and entropy coding across slice boundaries. Due to these disadvantages, exploit-
ing slice-level parallelism is only advisable when the number of slices per picture
is strictly limited [ 33 ].
￿
Block-Level Parallelization : In hardware-based implementations of H.264 j
MPEG-4 AVC, for example, a macroblock-level pipeline is very widely used.
This kind of block-level parallelization technique is based on using heterogenous
processing cores, where one core is dedicated for entropy coding, one for in-
loop filtering, one for intra prediction and so on. In this way, macroblocks
will be processed concurrently on the different cores. Note, however, that
efficient parallel processing of macroblocks may require an elaborate scheduling
Search WWH ::




Custom Search