In Depth Tutorials and Information

Linear Filtering of Images on The TI C62XX/C67XX (Image Processing) Part 3

Low-Pass Filtering with DSPLIB and Paging via DMA (blur_dsplib_paging_dma)

The DMA controller on the C62xx/C67xx allows for the transferring of data between internal memory and external memory and peripherals without intervention by the processor8’31. The DMA controller can be used to perform burst transfers of data, where only the initial access incurs the 15-17 cycle penalty, and the remainder entail only 1 -2 cycles per word. DMA must be employed for the quickest access to external memory. The project files for this program are to be found in the blur_dsplib_paging_dma subdirectory. The overall algorithm is identical to that of blur_dsplib_paging, except that DMA support infrastructure is incorporated and all memcpy calls and every DSP_blk_move call except one are replaced by a call to a local function, dma_copy_block. The relevant contents of blur_dsplib_paging_dma. c are shown in Listing 4-4.

Listing 4-4: portions of blur dsplib paging dma. c

This program is our first encounter with the C6x Chip Support Library (CSL)9, used here instead of the EVM library. To use the CSL, the project file should link to the appropriate library (in the case of the C6701 EVM, csl6701.1ib) and define a preprocessor symbol indicating the DSP architecture. This symbol is of the form CHIP_xxxx, where xxxx is replaced by the model of the DSP. So for the C6701, the macro CHIP_6701 is defined prior to inclusion of any of the CSL header files.

The dma_copy_block function in Listing 4-4 provides the speedup that this program achieves over its predecessors. Control of the DMA process is achieved by setting bit fields in the DMA registers, which is what occurs in the call to the DMA_conf igArgs function (for the exact details, refer to the CCStudio on-line help or [9]). The DMA transfer is set up so that the callee of dma_copy_block must provide the number of words to be transferred. Since this algorithm transfers both 16-bit short integers and 8-bit unsigned characters, and a word in the C6x architecture is 32 bits, a simple conversion must take place. The are two macros, ELEM_COUNT_UCH and ELEM_COUNT_SHORT, that are used towards this purpose inside of filter_image.

The DMA transfer is said to be asynchronous, meaning because the transfer is transparent to the CPU, the processor is free to perform other duties while the transfer is taking place. Because this program is coded in a serial fashion, we need to wait while the transfer is taking place. At the end of dma_copy_block the DSP is put in a busy spin loop until the global variable transfer_done is set to 1. The value of transfer_done is flipped to 1 in the interrupt service routine (ISR) c_int09, which in turn is hooked to interrupt_9 by the assembly function vectors defined in vecs . asm. IRQ_EVT_DMAINT 1 is enabled in set_interrupts_dma, which is called just prior to initiating the DMA transfer in dma_copy_block. This interrupt is mapped to interrupt 9 on the chip and indicates when the current DMA transfer has completed. Thus dma_copy_block is turned into a synchronous, or blocking, function since it sits in a tight loop until the transfer completes. Finally, note that transfer_done’s declaration is decorated with the volatile keyword. This keyword is common in embedded applications, and it is important to understand its usage.

Nowhere in the source is c_int09 ever explicitly called, and transfer done’s value is never modified anywhere else in the code. Of course, the programmer knows that the hardware will cause an interrupt to eventually be generated, resulting in c_int09 eventually being called, but the compiler has no way of knowing this. Whenever a variable may change its value in ways that the compiler cannot detect, the volatile keyword should be used or else an aggressively optimizing compiler may optimize the busy spin loop out of existence, which would indeed be an unfortunate occurrence.

This program takes an average of 1,552,183 cycles to low-pass filter a 256×256 image with a 5×5 kernel, using a block size of 16 rows. This performance time is 70% of blur_dsplib_paging, and offers a speedup of 1.42x over the image smoothing program that did not implement any memory optimizations (blur_dsplib). Even more performance could be eked out of this program by more sophisticated methods. The serialization in dma_copy__block, where the program sits in a busy spin loop waiting for the DMA transfer to complete, simply cries out for further optimization, and there is no reason why the DSP could not be performing useful work during the DMA transfer. This optimization leads to a "ping-pong" implementation, an example of which is given in [10], whereby two buffers, referred to as the ping and the pong buffer, are used to interleave data transfer and processing. After the DSP initiates a DMA transfer into the ping buffer, it moves on to processing the data contained in the pong buffer. When that processing has completed, the DSP initiates another DMA transfer into the pong buffer and moves on to processing the data in ping buffer, which contains the next set of data.

Full 2D Filtering with DSPLIB and DMA (f i 1 ter_dspl ib_paging_dma)

This implementation is a variation on the previous three programs, and is an example of a generic image filtering algorithm, capable of handling kernels other than low-pass filters with constant filter coefficients. This program relies on DSP_fir_gen like the others and also incorporates the DMA paging optimization we just introduced. Thus the performance this program offers provides a fair comparison with f ilter_imglib, as they both offer the same functionality. The source code is not listed here, as it is quite similar to the code in Listing 4-4; in fact the DMA code, ISR, and main are identical. The full project files may be found on the CD in the f i 11 e r_ds p 1 ib_pag i ng_dma subdirectory.

Essentially, this program operates in a fashion reminiscent of Figure 4-9, where each row in the image is passed through a potentially different ID FIR filter multiple times, as the filter mask marches down the image. However, this process is in turn segmented in the block-wise pattern depicted in Figure 4-11, to enable the DMA paging optimization. The cycle count for filtering a 256×256 image using a 5×5 kernel and 16-row blocks is 3,669,928 cycles. The memory optimized blurring program from the preceding section (blur_dsplib_paging_dma) filters the same-sized image in 42% of the time it takes this program, so if an application calls for an averaging kernel that program should be used instead. However, most 2D filters do not consist of kernels where each row is the same, so this program can be used for the more general case. And finally, even in this program we have implemented only first-order optimizations, and there are many additional low-level optimizations that could be performed on the code as well. For example, fixing the kernel size and thereby getting rid of many of the loops mitigates some looping overhead, although at the cost of some flexibility – this function has been written so that it could be used with both a 3×3 and a 5×5 kernel, for example. As a rule of thumb, the more general an algorithm implementation, the less opportunity there is for fine-tuning towards further optimization. Table 4-1 summarizes the performance results for the various C62xx/C67xx 2D filtering programs.

Table 4-1. Performance results for the various C62xx/C67xx image filtering programs, profiled using the clock function as described in [14], Cycle counts are the average of ten 10 runs, with the -o3 compiler optimization level and no debug symbols._

Program Comments Number Cycles
	Uses C function instead of	46,972,076
	IMG corr gen.
	Each row in kernel identical.	2,199,517
	Each row in kernel identical.	2,195,546
	Each row in kernel identical.	1,552,183
		3,669,928