Information Technology Reference
In-Depth Information
hardware) puts limits on the number of operations (more specifically, processor
instructions) performed by a kernel function. To work around this restriction,
we make use of a weight accumulation buffer (see [19]) and convert every term
of the summations in (7) into a separate pass, in which in each pass, one term
of the summation q ∈δ is added to the accumulation buffer. This is done for
both the numerator and denominator of (7). For i =1 , ...,
q i
defined for each pass (e.g. using raster scanning), we obtain the kernel function:
|
δ
|
, with constants
f ( i )
U ( i )
1
( p )=
,..., U ( i )
3
U ( i )
1
( p )
+ g ( Δx,Δy ) [ −B,...,B ] 2 r ( Δx,Δy )
2
U ( i )
2
U ( i )
1
( p + q i )
p , q i
(8)
+ g ( Δx,Δy ) [ −B,...,B ] 2 r ( Δx,Δy )
2
U ( i )
3
p , q i
U ( i )
2
U ( i )
3
where
is an accumulation buffer for the denoised image, and where
is
U (1)
2
( p )= U (1)
3
a weight accumulation buffer (initially,
( p )= 0
). Next, one last
pass is required, to compute the final output image:
T
U ( I )
2
( p )
f ( I )
U ( I )
1
( p )=
( p ) 00
,
(9)
,..., U ( I )
3
U ( I )
3
with I = |
δ
| +1. The number of operations per pass is now multiplied by a
factor 1 /
, but is still very high. To further reduce this number of opera-
tions, we could apply a similar split-up technique and convert the summation
|
δ
|
( Δx,Δy ) [ −B,...,B ] 2 into several passes. We note that, even though this way
we would obtain a working algorithm for most available GPUs, the number
of passes I = |
| (2 B +1) 2 +1 +1becomes very high. For example, for a
31 × 31 × 4-search window and B =4,weobtain I = 311365 passes. If for each
video frame, a single pass of the algorithm would take 0 . 1 msec. on a GPU, the
complete algorithm would still require approx. 31 sec. for processing one sin-
gle frame of a video sequence, which is similar to the computation time of our
CPU version mentioned in Section 1. Hence, further algorithmic accelerations
are required.
δ
2.3 Actual Implementation Using Algorithmic Accelerations
In [4], we pointed out that the term ( Δx,Δy ) [ −B,...,B ] 2 r ( Δx,Δy )
2 can be
interpreted as a convolution operator with a filter kernel with square support.
Consequently the Euclidean distance between two patches can eciently be com-
puted using a moving average filter, and the algorithmic complexity is reduced
with roughly a factor (2 B +1) 2 / 2. Unfortunately, converting a moving average
filter directly into a GPU program as in (4) is not feasible in a small number of
passes. Instead, we exploit the separability of the filter kernel and we implement
p , q
 
Search WWH ::




Custom Search