Non-Linear Filtering of Images (Image Processing) Part 3

Generating Noise with the Standard C Library

This project includes a class named AddNoise, which corrupts an image with salt and pepper noise. As we saw in the MATLAB code, generating image noise depends on pseudo-random number generation routines. While the standard C library does not have all of the helpful utility functions that MATLAB has for generating random sequences, with a bit of work the standard C library function rand can be used to generate the noise distributions we need in this application. These distributions are used within the static method SaltAndPepper, which in turn is called by the GUI code (see MedianFilterDlg. cpp) to create a corrupted image. Listing 4-11 is the C++ implementation file for the AddNoise class.

Listing 4-11: AddNoise. cdd

This method for generating shot noise first relies on a uniform distribution, one where the spread of the numbers is evenly distributed throughout a given interval. This distribution determines which pixels are to be corrupted. Prior to generating random numbers, one should seed the C random number generator via a call to srand. Failing to seed the random number generator results in always generating the same pseudo-random number sequence, which can be useful in its own right for debugging purposes. By repeatedly calling rand and dividing its output by the macro RANDOM AX (the maximum value that can be returned from rand), a uniform distribution within the interval [0.0,1.0] is generated – such a distribution is generated within the UniformDist method and stored within an STL vector.

The next step is to corrupt the condemned pixels by setting their intensity level to either 0 or 255, depending on the sign of a number from a normal distribution with zero mean, variance of one, and standard deviation of one, a classic distribution exhibiting the familiar symmetric "bell" shape17. There are numerous algorithms for converting a uniform distribution to a normal distribution. The polar method of Box and Muller, is one of the more common and is covered in [18]. RandN uses a simpler approximation that exploits the central limit theorem, which states that the means of random samples taken from any distribution tend to be normally distributed. Thus the result of UniformDist can be called k times and accumulated. Subtracting the sample mean (k/2 in this case because the distribution lies within the interval [0.0,1.0]) and then dividing by the square root of the sample variance:

produces a normal distribution with a variance of one and a standard deviation of one. For the case of A^=12, all that is required is to sum up 12 uniform samples and subtract 6. The handy STL function accumulate is used to implement this technique.

Profiling Code in Visual Studio .NET 2003

The purpose of this Visual Studio C/C++ application is two-fold. One, it serves as an interesting example of an image-processing algorithm "test bench". An application like this can be used as a boilerplate for jumpstarting future development. But more importantly, desktop C/C++ code is inherently closer to the eventual DSP implementation than MATLAB code can ever be, and thus because debugging is that much easier on the desktop, it sometimes makes sense to implement certain algorithm tweaks and optimizations at this level. That is not to say that MATLAB or other prototyping/simulation tools do not have their place – they most certainly do. But as we shall see shortly, we are going to be optimizing the median filter algorithm in ways that really are not amenable to a MATLAB-based implementation. However, to measure the efficiency of C/C++ algorithms we need a means of accurately profiling the code in Visual Studio.

There are a number of extremely well polished tools on the market that do a fantastic job of profiling code running on a Wintel machine. Two of the more popular are IBM Rational Quantify and the Intel® VTune™

Performance Analyzer. These tools are very sophisticated and plug into the Visual Studio .NET IDE. The programs in this topic use a much simpler profiling mechanism, similar to the usage of the clock function seen in previous DSP projects. The Windows high-resolution performance counter API is a useful set of functions that can be used for straightforward code profiling19.

Essentially this coding technique works just as it did in the DSP case. QueryPerformanceCounter obtains the current value of the counter, and the difference between two counter measurements, divided by the frequency of resolution, yields the time taken to execute the code bracketed by the two measurement points. This profiling method is employed in the MedianFilter class, which will be discussed in greater detail in the subsequent section. The only caveat is that these counter functions use a LARGE_INTEGER data type to store the value of the counter. A LARGE_INTEGER is a 64-bit quantity defined as a C union in the Microsoft header files. To access the entire quantity as a single entity for the purposes of performing arithmetic, testing for equality, or casting, the QuadPart member of the union should be used, as shown in the following code snippet:

Finally, whenever profiling code built using Microsoft Visual Studio, make sure to measure the "Release" configuration build, and not the "Debug" configuration build. The more recent versions of Microsoft’s compiler are extremely aggressive, and as is the case with the TI compiler, optimizations are by default completely disabled in the Debug configuration. The Build|Configuration Manager menu selection in Visual Studio allows one to change the active build configuration.

Various C/C++ Implementations of the Median Filter

The MedianFilter class, portions of which are given in Listing 4-12, does the heavy lifting in this application. This class performs median filtering using several different implementations, which are selected from the main GUI dialog via a dropdown list box:

1. Generalized KxK mask, with the neighborhood sorted via the standard C library function qsort, and median pixel value extracted from this sorted array (MedianFilter: : processImageQsort).

2. Same as (1), but use the STL function sort instead of qsort (MedianFilter: :processImageSTLsort).

3. Generalized KxK mask, using the Intel IPP function ippiFilterMedian_8u_ClR (MedianFilter: : process ImageIpp).

4. Fixed 3×3 mask, using an optimized median finding algorithm specifically targeted for arrays of 9 elements (MedianFilter: : process Image3x3).

5. Same as (4), but for the 5×5 case (MedianFilter: : process Image5x5).

The first two implementations are what could be deemed naive implementations in the sense that while they work, they are wholly unoptimized in every sense of the word. A simple means of computing the median pixel intensity in a neighborhood is to first sort the neighborhood, and then pick the central value from this sorted list. This scheme comes directly from the definition of the median (see Figure 4-13).

Listing 4-12: nortions of MedianFilter. cdd

The qsort function, which has been part and parcel of the ANSI C library from time immemorial, is an implementation of the "quicksort" sorting algorithm. Quicksort is a well-known algorithm that uses a divide-and-conquer methodology to sort a list of elements stored in an array20. The problem we run into when employing qsort to perform median filtering is that it is overkill for this purpose. An ANSI C qsort function is highly parameterized, so that it can be used to sort arrays of any type of object. This generality is realized through the use of a function pointer, which points to a comparison function that given two objects, returns an integer value based on the ordering between those two elements. In the MedianFilter: :processImageQsort method, each invocation of qsort is passed a pointer to the static class method compare. When passing a function pointer to a C routine that points to a C++ class method, the function pointer must point to a static class method. The reason for this is with normal class member functions, there is the implicit this pointer, where this refers to the current object. Passing in a C++ class member function pointer to qsort will result in a compilation error.

The incredible overhead associated with using the qsort function is apparent when considering the profiling results. Median filtering a 256×256 image with a 3×3 mask using qsort takes on average about 64,000 ps on a 2 GHz Pentium 4 workstation. Considering that method 3, which uses the Intel IPP median filtering function, takes on average a mere 431 ps (a better than two orders of magnitude performance increase) is enough evidence that qsort is most definitely not the way to go to implement efficient median filtering suitable for use on a DSP.

One red flag is the comparison function pointer. This means that a pointer is dereferenced and a function call is made for each comparison during the sort. A complexity analysis of Quicksort shows that on average, about nlog(n) comparisons are required, and in the worst case, n(n-l)/2 comparisons are needed, where n is the total number of pixels in a given neighborhood. Keeping in mind that the neighborhood is shifted about the image for every pixel location, this entails substantial overhead, as variables need to be pushed and popped to and from the stack with each invocation of the comparison function. Moreover, the most common implementation of the quicksort algorithm is recursive, due to its inherent simplicity and elegance – that is, the function calls itself. This function recursion in turn causes even more overhead in the algorithm, for the same reasons as the usage of the comparison function pointer in qsort (excessive pushing and popping). In general, due to this overhead a good rule of thumb is to be extremely wary of implementing any recursive algorithm on an embedded real-time system.

Method 2 slightly improves the situation. The implementation for this variant of the algorithm is virtually identical to the first method, except that the call to qsort is replaced with a call to the C++ STL function sort. The beauty of the STL is that it achieves generality with respect to data types through judicious use of C++ templates. The C qsort function achieves equivalent functionality through the use of function pointers in a type-unsafe manner (because arrays are treated as void pointers). Not only is the C++ code more aesthetically pleasing, this is also an example where a C++ language feature enables a speed boost. Because the STL consists of small template functions defined in header files, it is highly likely that the compiler will inline the sort function call directly into the processImageSTLsort method. In addition, now that there is no function pointer, the main loop that constitutes the sorting algorithm need not invoke a function to compare two elements. Without going into the specifics of the rules behind the STL, suffice it to say that as long as the objects in your container (i.e. a C array or STL vector) understand the < operator, as all plain old data (POD) types like ints, doubles, and floats do, then sort will work21. Finally, the C++ library that comes with Microsoft Visual Studio .NET 2003 uses a non-recursive insertion sort algorithm for small arrays of less than 32 elements, instead of a recursive Quicksort implementation. Yet the performance boost remains quite modest as compared to method 1 – filtering a 256×256 image with a 3×3 kernel still takes on average 55,000 ps.

The real bang for the buck occurs when taking a step back and considering the algorithm itself, instead of focusing on the code. There are really two levels of code optimization: optimization that focuses on low-level details, oftentimes processor or platform specific, and more high-level algorithmic issues. It is the latter where the most significant performance increases are usually attained, and these are best solved first on the desktop, because by the time one is developing on the embedded platform there are so many other issues to deal with. The essence of the performance problem here is that both qsort and sort are shackled by the fact that they are generic algorithms and perform a complete sort of the neighborhood, when in fact to compute a median value from a list of elements all that is needed is a partial sort of the data20. Just as implementing a minimum or maximum filter through picking either the first or last element from a sorted list obtained via some sorting routine is complete overkill (why not simply iterate through the neighborhood and cache the minimum/maximum intensities?), so is using a bona-fide sorting routine to implement a median filter for a small, fixed number of pixels. Partial sorting in this context means that the central value in the array is guaranteed to contain the median, but the ordering of the other elements in the list is not guaranteed. The partial sort procedure can then be hardwired per kernel size to achieve a blazingly fast implementation, because there will be no looping (and recursion) involved.

Figure 4-20 is a depiction of the exchange network for performing a complete sort of a list of three elements22, and thus the second element in the list is the median. Given a 3×3 neighborhood then, an initial thought might be to apply the algorithm shown in Figure 4-20 first across the three rows, and then down the middle (second) column. Unfortunately such a procedure only guarantees a pixel intensity of rank 4, 5 (the median), or 6 in the central array index, because the median filter is not separable.

It turns out that the minimum exchange network that partially sorts a nine-element list, and guarantees that the central element in that list is the actual median value, requires a total 19 comparisons23’24’25. This complete network is shown in Figure 4-21, and is optimal in the sense that without making any assumptions on the input, it is impossible to extract the median in a fewer number of comparison operations. The 3×3 minimum exchange network works by first performing "row" sorts, followed by "column" sorts, and concluding with "diagonal" sorts. This implementation uses the pix_swap and pix_sort macros defined just above the processImage3x3 method.

This custom, hardwired 3×3 median filter implementation is substantially faster – filtering a 256×256 image on the same test machine now takes on average about 9,750 ps. While this is still not as fast as the Intel IPP method, from an algorithmic perspective it is optimal – it is impossible to compute the median using a fewer number of comparison operations. The partial sorting optimization can be extended to other filter sizes. A minimum exchange network for a 5×5 median filter requires 99 comparisons, and is implemented in MedianFilter: :processImage5x5. Table 4-2 summarizes the profiling results for the median filtering implementations we implemented in this section.

Table 4-3. Performance results for various median filter implementations. Timing measurements taken using Visual Studio .NET 2003 release build (default options), using a 256×256 image, on a 2 GHz Pentium 4 workstation. The quoted times are averages of five runs.

Implementation	MedianFilter class method	Time (\|xs)
3×3 with qsort	processImageQsort	64,000
3×3 with STL sort	processImageSTLsort	55,000
3×3 with ippiFilterMedian_8u_ClR	processlmagelpp	431
3×3 with 9-element minimum exchange	processImage3x3	9,750
network

Figure 4-20. Exchange network for performing a complete sort of a list of 3 elements. The thin shaded boxes represent single-element sorters.

Figure 4-21. Minimum exchange network for computing the median of 9 elements. The C statements on the right come directly from the code in MedianFilter: : processImage3x3 that implements this procedure.

Non-Linear Filtering of Images (Image Processing) Part 3

Generating Noise with the Standard C Library

Profiling Code in Visual Studio .NET 2003

Various C/C++ Implementations of the Median Filter

Related Links

:: Search WWH ::