Linear Filtering of Images on The TI C62XX/C67XX (Image Processing) Part 2

Low-Pass Filtering Using DSPLIB (blur_dsplib)

Through judicious use of the DSPLIB function DSP_fir_gen, which performs one-dimensional filtering of signals7, one can implement a two-dimensional filtering algorithm. This particular implementation is not as general as f ilter_imglib, as the variable H in this case is a ID array of length NH, and as a consequence the only meaningful filters that can be represented in this fashion are low-pass averaging filters (where every filter coefficient is the same). This requirement leads to a simplification and optimization of the algorithm, and is useful to consider for that reason. A more general variant of the 2D filtering algorithm, where H is a 2D array and thus can represent any convolution, a la f ilter_imglib, is developed in the subsequent sections.

Figure 4-11 illustrates how this program goes about low-pass filtering an image, for a 3×3 kernel. Essentially what is going on is that since we store the input image as a flattened 2D array, if every row of the input image is run through the same FIR filter, we can simply FIR filter every row in the image with a single call to DSP f ir gen. There remain a few caveats:

Visualizing the results of the filter_imglib program by rendering out_img as a 256x256 RGB image.


Figure 4-10. Visualizing the results of the filter_imglib program by rendering out_img as a 256×256 RGB image.

1. We must discard some of the output from the FIR filter, because during the transition from row r to row r+1 (the last pixel in row r and the first pixel in row r+1), DSP_fir_gen assumes contiguous samples, and consequently this portion of the output has meaningless values.

2. DSP_f ir_gen requires the input array to be slightly larger than the image buffer, to account for the last few samples in the input array. Specifically, we must define in_img to be of length (# rows)(# columns) + (NH) – 1.

3. DSP_fir_gen expects the filter coefficient vector to be in reverse order, but since the averaging filter is symmetric (in fact it is constant) that requirement does not matter here.

Listing 4-2 is the contents of blur_dsplib. c, sans main, which remains identical to the version in f ilter_imglib. c. The memory inefficiencies are still present, but this program now takes on average 2,199,517 cycles to low-pass filter a 256×256 image with a 5×5 averaging kernel, or 4.7% percent of the clock cycles required for f ilter_imglib. Keeping in mind that this is not a truly fair comparison, as this program is only capable of a small (however important) subset of spatial filters, it nevertheless represents a substantial performance boost, and serves to illustrate just how important it is to use optimized code, especially when such functions are readily available.

2D filtering using the ID DSPLIB function DSP_fir_gen. In this diagrammatic representation of how blur_dsplib works for a 3x3 kernel, the steps to produce the pixel at out_img [ 1 ] [ 1 ] are shown. All of the pixels in in_img are passed through the FIR filter in one fell swoop, producing a series of contiguous filtered rows in f ir_output. The second element in each of the first three filtered rows are then summed to produce the final 2D filtered pixel.

Figure 4-11. 2D filtering using the ID DSPLIB function DSP_fir_gen. In this diagrammatic representation of how blur_dsplib works for a 3×3 kernel, the steps to produce the pixel at out_img [ 1 ] [ 1 ] are shown. All of the pixels in in_img are passed through the FIR filter in one fell swoop, producing a series of contiguous filtered rows in f ir_output. The second element in each of the first three filtered rows are then summed to produce the final 2D filtered pixel.

 

Listing 4-2: portions of blur_dsplib. c

Listing 4-2: portions of blur_dsplib. c

 

 

 

 

Listing 4-2: portions of blur_dsplib. c

Low-Pass Filtering with DSPLIB and Paging (blur_dsplib_paging)

It has been repeatedly alluded to that failing to consider the memory footprint of an image processing algorithm implementation is a major detriment to performance. Indeed, the external memory interface (EMIF) on the C6701 is slow – it takes between 15-17 cycles to access a pixel stored in external RAM, versus a single cycle for a pixel in on-chip RAM8. Such latencies result in the DSP stalling while it waits for data to arrive via the EMIF. This issue of data residing in external off-chip RAM is even more pressing in comparison to the point processing operations of next topic, because here the interior pixels of the input image need to be accessed multiple times. Consider an interior image pixel and a 5×5 kernel. This pixel will be accessed 25 times, for a worst-case access penalty of (25)(17) = 425 cycles, versus just 17+25 = 42 cycles if the pixel is first copied from external RAM and then accessed repeatedly from on-chip RAM. Because image filtering in general is a well-structured algorithm, this spatial locality (once the algorithm has accessed the pixel NH2 times, it is never needed it again) can be exploited using a memory optimization technique known as paging. The blur_dsplib_paging program provides the same functionality as blur_dsplib, but augments that functionality by paging in blocks of the image to an on-chip input scratch buffer (input buf) prior to passing image pixels through the FIR filter implemented via DSP_fir_gen. Additionally, another scratch on-chip buffer, output_buf, contains the filtered pixels for the current block, and when this entire block has been filtered (and will no longer be referenced again), the contents of output_buf are paged out to out_img, which also resides in external RAM. In Listing 4-3 the relevant portions of blur_dsplib_paging. c pertaining to this memory optimization is given.

Listing 4-3: portions of blur_dsplib_paging. c

Listing 4-3: portions of blur_dsplib_paging. c

 

 

 

 

Listing 4-3: portions of blur_dsplib_paging. c

 

 

 

 

 

Listing 4-3: portions of blur_dsplib_paging. c

 

 

 

Listing 4-3: portions of blur_dsplib_paging. c

 

This program uses two functions for shuttling blocks of pixels around: memcpy and DSP_blk_move. DSP_blk_move is optimized for word-aligned 16-bit short integers7, and thus is used for moving data between those buffers containing Q15 data :in_img, input_buf, and fir_buf. The standard C library function memcpy is used for paging the 8-bit processed pixels out from internal RAM to out_img in external RAM. Note that the transition between blocks provides an avenue for further optimization, as the last BOUNDARY rows of the input image scratch buffer (input_buf) need not be paged in if the last BOUNDARY rows of the scratch FIR output buffer (fir_buf) is moved from the bottom of the buffer to the top in a circular fashion.

With the same 256×256 image, a 5×5 smoothing kernel, and 16 row block, blur_dsplib_paging takes on average 2,195,546 cycles to perform the low-pass filtering operation. While this constitutes a savings of 3,971 cycles versus an implementation that does not utilize paging, it represents only a very small savings of .18%. A far more dramatic time savings will be accomplished by using the C62xx/C67xx DMA controller.

Next post:

Previous post: