Linear Filtering of Images on The TI C64X (Image Processing) Part 1

As described in 2.1, the C6416 fixed-point DSP is a newer member of the C6000 DSP family that offers higher performance than the C62xx series (it is not truly fair to compare the C64x to the C67x, as the C67x is a floatingpoint architecture). With regards to the CCStudio projects and C source code accompanying this topic, all of the C62xx/C67xx projects were built and tested using the EVM development environment. The C64xx projects in this topic, on the other hand, were built and tested using the C6416 DSK.

This section describes how to optimize a low-pass filtering program targeting the C6416. As one would expect for a DSP marketed to the imaging community, there is a version of IMGLIB optimized specifically for the C64xx line of DSPs11. Starting with a core IMGLIB convolution routine (that works!), an initial fixed-point implementation of a 3×3 low-pass filtering operation will be optimized in much the same fashion as in sections 4.3.3-4.3.5, by paging in blocks of the image as they are needed.

The source code for these programs are found in the Chap4\LinearFilter\C64xx directory. As in the C62xx/C67xx case, there are some common project files: the image. h header file containing default pixel data for the Lenna image and the linker command file image_filter.cmd. The image.h header file remains much the same as its C62xx/C67xx counterpart, except that in_img buffer is defined to be exactly N_PIXELS long, as we do not need to pad it with extra samples because we are no longer using DSP_fir_gen to filter the image. The linker command file for the projects is similar to that of the C62xx/C67xx projects, but the memory map for the C6416 DSK is different from that of the C62xx/C67xx EVM, meaning the MEMORY section is tailored for the DSK. Listing 4-5 shows the contents of the C64x DSK image_filter.cmd. Note that normally the ".cinit" section would map to the I RAM section, but due to the large amount of initialization data it is mapped to an external RAM segment.

Listing 4-5: Example C6416 DSK linker file, imaqe filter. cmd.

The supporting infrastructure also changes slightly to accommodate changes necessitated by switching over to a different processor and the DSK environment. In particular, these programs link to C6416 versions of the runtime support library (rts6400. lib), chip support library (csl6416 . lib), and a library we have yet to encounter, the board support library (dsk6416. lib). And of course, the programs link to a different IMGLIB static library (img64x. lib). Finally, the CHIP_xxxx preprocessor symbol is set to CHI P_6416.

Low-Pass Filtering with a 3×3 Kernel Using IMGLIB (blur3x3_imglib)

Consider the following 3×3 smoothing kernel:

Similar to a Gaussian kernel, this kernel gives the center pixel more of a contribution than its surrounding neighbors and the sum of the weights is 1, a requirement for an averaging kernel that maintains the gain of the input image. A naive implementation would be to simply apply the convolution equation directly, but a far more efficient fixed-point implementation is to factor out the division by 16, so that for a neighborhood centered about pixel f(i,j) the output pixel g(i,j) is:

A further optimization is to replace the division by 16 with the equivalent operation of bit shifting to the right by 4 bits. This fixed-point convolution algorithm can be simulated in MATLAB with the following code:

H = [1 2 1; 2 4 2; 1 2 1];

J = imfilter(uintl6(I), H); % I is a uint8 image matrix

J = uint8(bitshift(J), -4); % divide by 16

In the second line, I is promoted from 8 to 16 bits because the imfilter command returns a matrix of the same type as the input matrix, and thus without this type promotion J will consist of mostly saturated (255) values if I is originally of type uint8. If conv2 is used instead of imfilter then the uintl6 qualifier is not required, as conv2 promotes everything to the double type. The third statement is a vectorized bit shift operation – the entire matrix is divided by 24.

IMGLIB includes a function IMG_conv_3x3 that can be used to pass an image through a 3×3 kernel consisting of signed 8-bit coefficients12. Internally, the function uses three 16-bit accumulators that sum intermediate values during the convolution operation, and so the caller must provide a shift value, which for this particular kernel is 4. Listing 4-6 is the contents of blur_3x3_imglib. c, which is the C source file for the C6416 program located in Chap4\LinearFilter\C64xx\blur3x3_imglib.

Listing 4-6: blur3x3_imglib. c

One difference between this program and the previous C62xx/C67xx EVM programs is the timing mechanism. Unless otherwise stated, the C6416 DSK programs in this topic use the CSL timer API as described in [12], whereas the EVM programs use the clock function for profiling code. Another difference is the addition of a new local function memclear, which is used as a faster substitute for memset.

The implementation of memclear lends insight into how low-level code optimizations can provide major performance enhancements. Staying with the idea that more generality sometimes means a decrease in speed, the memclear function leverages certain known characteristics of the structure of the surrounding code to gain a performance edge over the standard C library function memset, which obviously has to be general-purpose in order to maintain its contract with the programmer. This function offers advantages over the general-purpose memset function, which in C might be defined as so:

By stipulating that the number of iterations through the memclear loop (commonly referred to as the trip count) is a multiple of 8, the input pointer casted to a long (64-bit) type, and guaranteeing alignment of lptr to a 64bit boundary via _nassert, the compiler is given numerous pieces of information so that it can generate a loop that will run faster than a memset-like function. In the definition of memclear, what appears to be a function call to nassert is actually an example of a Tl-specific compiler intrinsic13. Intrinsics are extensions to ANSI C that either map to inline C6x assembly instructions that cannot be expressed in a pure ANSI-compliant C translation unit, or as in the case of _nassert, provide extra information to the compiler. Here, the statement_nassert ( (int) lptr%8==0) asserts that the address of lptr is double-word aligned. Consequently, the compiler is free to use the LDDW/STDW (load/store aligned double-word) instructions to initialize the 64-bit number pointed to by lptr. LDDW/STDW are special instructions that operate on a data stream lying on an aligned memory address14, and these aligned instructions are more efficient than their unaligned counterparts (the C64x DSP has non-aligned double word instructions, LDNDW/STDNW, or load/store non-aligned double word). Wherever possible, unaligned stores and loads should be avoided as the DSP can only perform a single unaligned load per clock cycle, whereas multiple aligned loads can occur in a single cycle. The more conservative code compiles to assembly language using LDB/STB (load/store byte) instructions to initialize the 8-bit number pointed to by lptr, and a series of these instructions is not as efficient as a series of LDDW/STDW instructions due to the lessened throughput of the data flowing through the DSP.

The MUST_ITERATE pragma directive in memclear is a means of providing the compiler information about a loop, and is analogous to the .trip directive in linear assembly code14. Through this directive the programmer can specify the exact number of times a loop will execute, if the trip count is a multiple of some number, the minimum number of iterations through the loop, and so on. This pragma should be used wherever possible -especially when the minimum trip count is known as this information allows the compiler to be more aggressive when applying loop transformations. The form of the MUST_ITERATE pragma used in memclear specifies that the loop is guaranteed to execute at least 32 times, and armed with this information the compiler can proceed to unroll the loop. Loop unrolling is a technique where the loop kernel (not to be confused with a filter kernel) is expanded by a factor X – and the loop stopping condition adjusted to N/X -with the intent of reducing the number of branches. By reducing the branch overhead, the efficiency of the loop is increased, and it also creates an opportunity for better scheduling of instructions contained within the loop kernel. However, it is not always the case that it is advantageous to unroll a loop. If a loop is unrolled too much, the code size may increase such that it overflows the instruction cache, which essentially defeats the purpose of the loop transformation. This particular loop however, given its small size (a single assignment statement) is not at risk of this problem. A further holdback is if the loop does not execute enough times to even warrant unrolling. By stipulating that the minimum number of times through this loop is 32, the compiler knows that it should proceed with unrolling the loop.

Getting back to the 2D filtering algorithm, the implementation in Listing 4-6 is more concise and frankly far simpler than earlier incarnations, for a couple of reasons. Because IMG_conv_3x3 is explicitly designed to perform 2D filtering with a fixed kernel size, there are no intermediate ID results that need to be combined to form 2D filtered pixels. In addition, this particular program does not consider the memory bottlenecks, and simply accesses pixel data stored in external RAM. Figure 4-12 is a CCS screenshot of this program, halted just after the call to f ilter_image. Both the original image (in_img) and the processed image (out_img) are "graphed" side-by-side. With the -o3 compiler optimization level and no debug symbols, this program takes on average 19,141,824 cycles.

Linear Filtering of Images on The TI C64X (Image Processing) Part 1

Low-Pass Filtering with a 3×3 Kernel Using IMGLIB (blur3x3_imglib)

Related Links

:: Search WWH ::