Graphics Reference
In-Depth Information
Fig. 6.11
Efficient
implementation of inverse
transform of a block with
non-zero coefficients in only
the
K
K 1D IDCT
along columns
KxK
K
low frequency
sub-block. Shaded regions
denote the regions that can
contain non-zero coefficients.
Only
K
1D IDCTs are
required along columns
NxN
N 1D IDCT
along rows
Tabl e 6. 5
Arithmetic
operation counts for HEVC
two-dimensional inverse
transforms
N
K
O
mult
O
add
4
4
48
64
8
8
352
448
8
4
132
228
16
16
2,752
3,200
16
8
1,032
1,512
16
4
420
820
32
32
21,888
23,808
32
16
8,208
10,320
32
8
3,400
5,320
32
4
1,476
3,060
In general, the number of multiplications can be reduced approximately by a
factor of (
N
/
K
)
2
for the first stage and a factor of (
N
/
K
) for the second stage.
Tab le
6.5
shows the number of arithmetic operations for various values of
N
and
K
.
Note that the majority of the arithmetic operations listed in Table
6.5
can be
efficiently implemented using SIMD instructions since the operations are matrix
multiply operations. For example, for an 8
8 inverse transform implementation,
(
6.20
) can be efficiently implemented on a 4-way SIMD processor in 4 cycles v/s
16 cycles on a processor without SIMD acceleration. Software performance using
SIMD acceleration on various Intel processor architectures for the 8
8, 16
16,
and 32
32 transform sizes are provided in [
3
,
11
].
Only the Even-Odd decomposition of the inverse transform has been described in
this subsection. However, the Even-Odd decomposition idea can be used to reduce
the complexity of the forward transform too. The article [
6
] presents analysis of
both the forward and the inverse core transform in more details. It also describes
hardware sharing by the application of property 4 of Sect.
6.2.1
(smaller transforms
being embedded in larger transforms).