A GPU-Accelerated Real-Time NLMeans Algorithm for Denoising Color Video Sequences - Advanced Concepts for Intelligent Vision Systems

Information Technology Reference

In-Depth Information

hardware) puts limits on the number of operations (more specifically, processor

instructions) performed by a kernel function. To work around this restriction,

we make use of a weight accumulation buffer (see [19]) and convert every term

of the summations in (7) into a separate pass, in which in each pass, one term

of the summation q ∈δ is added to the accumulation buffer. This is done for

both the numerator and denominator of (7). For i =1 , ...,

q i

defined for each pass (e.g. using raster scanning), we obtain the kernel function:

, with constants

f ( i )

U ( i )

( p )=

,..., U ( i )

⎛

⎝

⎞

⎠

U ( i )

( p )

+ g ( Δx,Δy ) ∈ [ −B,...,B ] 2 r ( Δx,Δy )

U ( i )

( p + q i )

p , q i

(8)

+ g ( Δx,Δy ) ∈ [ −B,...,B ] 2 r ( Δx,Δy )

U ( i )

p , q i

U ( i )

where

is an accumulation buffer for the denoised image, and where

U (1)

( p )= U (1)

a weight accumulation buffer (initially,

( p )= 0

). Next, one last

pass is required, to compute the final output image:

U ( I )

( p )

f ( I )

U ( I )

( p )=

( p ) 00

(9)

,..., U ( I )

U ( I )

with I = |

| +1. The number of operations per pass is now multiplied by a

factor 1 /

, but is still very high. To further reduce this number of opera-

tions, we could apply a similar split-up technique and convert the summation

( Δx,Δy ) ∈ [ −B,...,B ] 2 into several passes. We note that, even though this way

we would obtain a working algorithm for most available GPUs, the number

of passes I = |

| (2 B +1) 2 +1 +1becomes very high. For example, for a

31 × 31 × 4-search window and B =4,weobtain I = 311365 passes. If for each

video frame, a single pass of the algorithm would take 0 . 1 msec. on a GPU, the

complete algorithm would still require approx. 31 sec. for processing one sin-

gle frame of a video sequence, which is similar to the computation time of our

CPU version mentioned in Section 1. Hence, further algorithmic accelerations

are required.

2.3 Actual Implementation Using Algorithmic Accelerations

In [4], we pointed out that the term ( Δx,Δy ) ∈ [ −B,...,B ] 2 r ( Δx,Δy )

2 can be

interpreted as a convolution operator with a filter kernel with square support.

Consequently the Euclidean distance between two patches can eciently be com-

puted using a moving average filter, and the algorithmic complexity is reduced

with roughly a factor (2 B +1) 2 / 2. Unfortunately, converting a moving average

filter directly into a GPU program as in (4) is not feasible in a small number of

passes. Instead, we exploit the separability of the filter kernel and we implement

p , q

Advanced Concepts for Intelligent Vision Systems

Search WWH ::

Custom Search

Home