Information Technology Reference
In-Depth Information
hardware) puts limits on the number of operations (more specifically, processor
instructions) performed by a kernel function. To work around this restriction,
we make use of a weight accumulation buffer (see [19]) and convert every term
of the summations in (7) into a separate pass, in which in each pass, one term
of the summation
q
∈δ
is added to the accumulation buffer. This is done for
both the numerator and denominator of (7). For
i
=1
, ...,
q
i
defined for each pass (e.g. using raster scanning), we obtain the kernel function:
|
δ
|
, with constants
f
(
i
)
U
(
i
)
1
(
p
)=
,...,
U
(
i
)
3
⎛
⎝
⎞
⎠
U
(
i
)
1
(
p
)
+
g
(
Δx,Δy
)
∈
[
−B,...,B
]
2
r
(
Δx,Δy
)
2
U
(
i
)
2
U
(
i
)
1
(
p
+
q
i
)
p
,
q
i
(8)
+
g
(
Δx,Δy
)
∈
[
−B,...,B
]
2
r
(
Δx,Δy
)
2
U
(
i
)
3
p
,
q
i
U
(
i
)
2
U
(
i
)
3
where
is an accumulation buffer for the denoised image, and where
is
U
(1)
2
(
p
)=
U
(1)
3
a weight accumulation buffer (initially,
(
p
)=
0
). Next, one last
pass is required, to compute the final output image:
T
U
(
I
)
2
(
p
)
f
(
I
)
U
(
I
)
1
(
p
)=
(
p
)
00
,
(9)
,...,
U
(
I
)
3
U
(
I
)
3
with
I
=
|
δ
|
+1. The number of operations per pass is now multiplied by a
factor 1
/
, but is still very high. To further reduce this number of opera-
tions, we could apply a similar split-up technique and convert the summation
|
δ
|
(
Δx,Δy
)
∈
[
−B,...,B
]
2
into several passes. We note that, even though this way
we would obtain a working algorithm for most available GPUs, the number
of passes
I
=
|
|
(2
B
+1)
2
+1
+1becomes very high. For example, for a
31
×
31
×
4-search window and
B
=4,weobtain
I
= 311365 passes. If for each
video frame, a single pass of the algorithm would take 0
.
1 msec. on a GPU, the
complete algorithm would still require approx. 31 sec. for processing one sin-
gle frame of a video sequence, which is similar to the computation time of our
CPU version mentioned in Section 1. Hence, further algorithmic accelerations
are required.
δ
2.3 Actual Implementation Using Algorithmic Accelerations
In [4], we pointed out that the term
(
Δx,Δy
)
∈
[
−B,...,B
]
2
r
(
Δx,Δy
)
2
can be
interpreted as a convolution operator with a filter kernel with square support.
Consequently the Euclidean distance between two patches can eciently be com-
puted using a moving average filter, and the algorithmic complexity is reduced
with roughly a factor (2
B
+1)
2
/
2. Unfortunately, converting a moving average
filter directly into a GPU program as in (4) is not feasible in a small number of
passes. Instead, we exploit the separability of the filter kernel and we implement
p
,
q
Search WWH ::
Custom Search