A GPU-Accelerated Real-Time NLMeans Algorithm for Denoising Color Video Sequences - Advanced Concepts for Intelligent Vision Systems

Information Technology Reference

In-Depth Information

2.5 Discussion

To optimize the computational performance of a GPU program, minimizing the

number of passes I and performing more operations in each kernel function is more

beneficial than optimizing the individual kernel functions themselves, especially

when the kernel functions are relatively simple (as in our algorithm in Section 2.3).

This is due to GPU memory caching behavior and also because every pass typi-

cally requires interaction with the CPU (for example, the computation time of an

individual pass can be affected by the process CPU scheduling granularity). To as-

sess the computational performance improvement, a possible solution would be to

use theoretical models to predict the performance. Unfortunately, these theoret-

ical models are very dependent on the underlying GPU architecture: the compu-

tational performance can not simply be expressed as a function of the total num-

ber of floating point operations, because of the parallel processing. To obtain a

rough idea of the computational performance we use the actual number of passes

required by our algorithm. For example, when comparing our algorithmic accel-

erations from Section 2.3 to the naive NLMeans-algorithm from Section 2.2, we

see that the number of passes is reduced with a factor:

|

| (2 B +1) 2 +1 +1

4( |

δ

(2 B +1) 2

2

≈

.

δ

| +1) / 2+1

For patches of size 9 × 9, the accelerated NLMeans GPU algorithm requires

approximately 40 times less processing passes.

Another point of interest is the streaming behavior of the algorithm: for real-

time applications, it is required the algorithm processes video frames as soon as

they become available. In our algorithm, this can be completely controlled by

adjusting the size of the search window. Suppose we choose:

δ =[ −

D past , ..., D future ]

with A, D past ,D future ≥ 0 positive constants. A determines the size of the spatial

search window; D past and D future are respectively the number of past and future

frames that the filter uses for denoising the current frame. For causal implemen-

tation of the filter, a delay of D future frames is required. Of course, D future can

even be zero, if desired. However, the main disadvantage of a zero delay is that

the translation technique from Section 2.3 cannot be used in the temporal di-

rection, because the translation technique in fact requires the updating of future

frames in the accumulation buffer. Nevertheless, using a small positive D future ,

a trade-off can be made between the filter delay and the algoritmic acceleration

achieved by exploiting the weight symmetry. The number of video frames in

GPU memory is at most 4( D past + D future +1).

A, ..., A ] × [ −

3

Experimental Results

To demonstrate the processing time improvement of our GPU algorithm with

the proposed accelerations, we apply our technique to a color video sequence of

resolution 720 × 480 (a resolution which is common for DVD-video). The video se-

quence is corrupted with artificially added stationary white Gaussian noise with

Advanced Concepts for Intelligent Vision Systems

Search WWH ::

Custom Search

Home