Linear Filtering of Images on The TI C64X (Image Processing) Part 2

A Memory-Optimized 2D Low-Pass Filter (blur3x3_imglib_paging_dma)

In previous sections, we proceeded through a step-by-step dissection of an optimization of 2D image filtering by reducing memory latencies – now that we have seen this once, we can cut to the chase. Listing 4-7 are the relevant portions of the blur3x3_imglib_paging_dma. c file, which can be found in the CCStudio project directory located in Chap4\ LinearFilter\C64xx\blur3x3_imglib_paging_dma.

Low-pass filtering of the Lenna image on the C6416 DSK. The image on the left is the original image, stored in the in__img array. On the right is the low-pass filtered image, stored in the out_img array.

Figure 4-12. Low-pass filtering of the Lenna image on the C6416 DSK. The image on the left is the original image, stored in the in__img array. On the right is the low-pass filtered image, stored in the out_img array.


Listing 4-7: portions ofblur3x3_imglib_paging_dma. c.

Listing 4-7: portions ofblur3x3_imglib_paging_dma. c.

* NOTE: pad output_buf with 2*BOUNDARY pixels because

* IMG_conv_3x3 required # cols arg to be multiple of 8.

* If this wasn’t done you’d write past the end of the array.

Listing 4-7: portions ofblur3x3_imglib_paging_dma. c.

 

 

 

Listing 4-7: portions ofblur3x3_imglib_paging_dma. c.

 

 

 

 

 

Listing 4-7: portions ofblur3x3_imglib_paging_dma. c.

This program utilizes the paging memory management "design pattern", to borrow a term from software engineering, to circumvent the problems associated with accessing data located in slow off-chip RAM. As image data is needed, blocks are paged in and out via DMA channels. There are a few subtle changes in the C64xx implementation that do warrant additional discussion. For starters, the DMA specific portion of the implementation is more concise than its C62xx/C67xx counterpart (see Listing 4-4), because here we do not use an ISR to wait for the DMA transfer to complete. Instead, this code relies on the CSL API function DAT_wait to wait for a memory transfer to complete (internally DAT_wait more than likely utilizes an ISR but this implementation detail is well hidden from the programmer). The DMA mechanism in the C6xlx line of DSPs (e.g. C6211, C6711, C6713, and C6416) differs from that of the C6x0x series (C6201, C6205, C6701) in that an enhanced DMA (EDMA) peripheral supplants the older DMA peripheral. The EDMA controller offers certain advantages over the legacy DMA controller, such as a larger number of channels (64), high concurrency, and automatic re-arming of trigger events so that a sequence of transfers can be triggered from a single event9. As we shall soon see, the abstraction accorded to us by the CSL API allows the program to take advantage of a small amount of parallelism, without having to implement a complicated double-buffering or ping-pong scheme.

The general structure of the filtering algorithm closely follows that of Listing 4-4. The blocking function DAT_wait is used to block until DMA transfers are complete. There is some amount of parallelism in this implementation – note the call to DAT wait is made after the call to memclear in the algorithm prologue section. Similar scheduling is also used during the filtering of the image interior, with the basic premise being to continue on with other independent tasks while DMA transfers are taking place, since the processor is free while the EDMA controller is handling the data transfer. This type of parallelism that DMA offers is a major benefit, in addition to the fact that DMA provides burst transfers for fast access to off-chip RAM and other peripherals.

There is a version of this program (blur3x3_imglib_paging. c) in the same directory that uses memcpy to page in blocks of memory, and with the standard profiling compiler options, that program takes on average 10,408,890 cycles to filter our 256×256 image with the 3×3 smoothing kernel, or roughly half the time it takes the completely memory-unoptimized version to execute. The EDMA-enabled version of this algorithm takes only 1,646,381 cycles on average to execute, for a speedup of over 11.5x versus the memory-unoptimized version! Table 4-2 summarizes the performance benchmarks for the three C6416 DSK image filtering programs.

Table 4-2. Performance results for three C4616 DSK image filtering programs, profiled using the CSL timer API as described in [12]. Cycle counts are the average of ten runs, with no debug symbols and the -o3 compiler optimization level._ _

Program

Comments

Number Cycles

blur 3×3 imglib

No memory optimization.

19,141,824

blur 3×3 imglib paging

Paging via memcpy function.

10,408,890

blur 3×3 imglib paging dma

Paging via EDMA.

1,646,381

Next post:

Previous post: