Information Technology Reference
In-Depth Information
2.5 Discussion
To optimize the computational performance of a GPU program, minimizing the
number of passes I and performing more operations in each kernel function is more
beneficial than optimizing the individual kernel functions themselves, especially
when the kernel functions are relatively simple (as in our algorithm in Section 2.3).
This is due to GPU memory caching behavior and also because every pass typi-
cally requires interaction with the CPU (for example, the computation time of an
individual pass can be affected by the process CPU scheduling granularity). To as-
sess the computational performance improvement, a possible solution would be to
use theoretical models to predict the performance. Unfortunately, these theoret-
ical models are very dependent on the underlying GPU architecture: the compu-
tational performance can not simply be expressed as a function of the total num-
ber of floating point operations, because of the parallel processing. To obtain a
rough idea of the computational performance we use the actual number of passes
required by our algorithm. For example, when comparing our algorithmic accel-
erations from Section 2.3 to the naive NLMeans-algorithm from Section 2.2, we
see that the number of passes is reduced with a factor:
|
| (2 B +1) 2 +1 +1
4( |
δ
(2 B +1) 2
2
.
δ
| +1) / 2+1
For patches of size 9 × 9, the accelerated NLMeans GPU algorithm requires
approximately 40 times less processing passes.
Another point of interest is the streaming behavior of the algorithm: for real-
time applications, it is required the algorithm processes video frames as soon as
they become available. In our algorithm, this can be completely controlled by
adjusting the size of the search window. Suppose we choose:
δ =[
D past , ..., D future ]
with A, D past ,D future 0 positive constants. A determines the size of the spatial
search window; D past and D future are respectively the number of past and future
frames that the filter uses for denoising the current frame. For causal implemen-
tation of the filter, a delay of D future frames is required. Of course, D future can
even be zero, if desired. However, the main disadvantage of a zero delay is that
the translation technique from Section 2.3 cannot be used in the temporal di-
rection, because the translation technique in fact requires the updating of future
frames in the accumulation buffer. Nevertheless, using a small positive D future ,
a trade-off can be made between the filter delay and the algoritmic acceleration
achieved by exploiting the weight symmetry. The number of video frames in
GPU memory is at most 4( D past + D future +1).
A, ..., A ] × [
A, ..., A ] × [
3
Experimental Results
To demonstrate the processing time improvement of our GPU algorithm with
the proposed accelerations, we apply our technique to a color video sequence of
resolution 720 × 480 (a resolution which is common for DVD-video). The video se-
quence is corrupted with artificially added stationary white Gaussian noise with
 
Search WWH ::




Custom Search