Convolution and The DWT (Code Optimization ) (Image Processing)

Sometimes a reordering of straight C code is enough to goad the TI compiler into generating more vectorized code. The following examples are indicative of the limitations that prevent the compiler from delving deep into the recesses of the programmer’s mind and autonomously discern how best to optimize the code. At times the compiler needs a kick in the pants, and it is incumbent upon the programmer to do their due diligence. Sometimes this means taking a look at the generated assembly code.

In 6.3.2.1, a 2D D4 DWT and IDWT implementation on the C6416 was discussed. Performing the D4 DWT involves convolving a 4-tap filter with an input signal, which in the case of a 2D DWT is either a row or column of a matrix. The convolution essentially calls for flipping, or time-reversing, the filter coefficients and then "sliding" it across the input signal. At each convolution position, or "lag" as it is commonly referred to, the sum of vector products between the flipped filter array and a segment of the input signal is computed (see Figure 6-20). In Listing B-6, a portion of the wave_horz function from daub4.c (see Listing 6-16) is reproduced below. This is but one of many convolution loops needed to implement the 2D D4 DWT and IDWT.

Listing B-6: Portions of wave_horz (see Listing 6-16 for the complete code listing).

Listing B-6: Portions of wave_horz (see Listing 6-16 for the complete code listing).


Listing B-7 is a portion of the generated assembly file (daub4.asm), corresponding to the C code in Listing B-6, with the CCStudio (version 2.20) default release build options selected. To view generated assembly code, use the -k option during compilation, which tells the TI C compiler to retain all .asm files. To set this option within CCStudio, select Project|Build Options, click on the "Compiler" tab, select the "Assembly" category, and finally select the "Keep Generated .asm Files" checkbox.

Listing B-7: Assembly code for the convolution loop in Listing B-6, using standard release build options of CCStudio (version 2.20).

Listing B-7: Assembly code for the convolution loop in Listing B-6, using standard release build options of CCStudio (version 2.20).

 

 

 

 

Listing B-7: Assembly code for the convolution loop in Listing B-6, using standard release build options of CCStudio (version 2.20).

Reading assembly is difficult on the eyes (even more so when reading parallelized C6x assembly), however the important point is that even though the compiler has generated code using LDNDW, it is still possible to improve this loop using techniques from the preceding section. In Listing B-8 is a rewritten loop utilizing intrinsics to take advantage of packed data processing via _dotp2. In order for this loop to work correctly, the hLP array has to be time-reversed in code, which then allows us to use the _dotp2 intrinsic. Rather than reverse the array at the beginning of each call to wave horz, the code in Listing B-8 assumes that the caller passes in arrays containing flipped D4 filter coefficients.

Listing B-8: A vectorized version of the convolution loop shown in Listing B-6 that uses compiler intrinsics.

Listing B-8: A vectorized version of the convolution loop shown in Listing B-6 that uses compiler intrinsics.

 

 

 

 

Listing B-8: A vectorized version of the convolution loop shown in Listing B-6 that uses compiler intrinsics.

This loop does produce assembly code that uses DOTP2 and hence takes advantage of the C6416 instruction set. This is a perfectly valid solution, but it should be noted that due to this reordering of the filter coefficients in the hLP and hHP arrays, the compiler now has enough information to generate vectorized assembly on its own. In the chap6\daub4 directory, there is a C file daub4_optimi zed. c that performs the 2D D4 DWT and IDWT as described in 6.3.2.1, but rearranges the filter coefficients so that the compiler vectorizes the convolution loops in the fashion shown in Listing B-8. This can be seen in Listing B-9, which shows a portion of the wave_horz function from daub4_optimized. c file along with the generated assembly code. In contrast to the assembly code in Listing B-7, notice the presence of the DOTP2 instruction. The IMGLIB wavelet functions stipulate that the wavelet filter coefficients be flipped from how they would normally be specified, a la daub4_optimized. c, presumably for the same reason.

Listing B-9: Functionally the same convolution loop as shown in Listing B-6, but assumes the filter coefficients in hLP are time-reversed. This code comes from daub4_optimized. c, and the generated assembly from this loop kernel follows below.

Listing B-9: Functionally the same convolution loop as shown in Listing B-6, but assumes the filter coefficients in hLP are time-reversed. This code comes from daub4_optimized. c, and the generated assembly from this loop kernel follows below.

 

 

 

 

Listing B-9: Functionally the same convolution loop as shown in Listing B-6, but assumes the filter coefficients in hLP are time-reversed. This code comes from daub4_optimized. c, and the generated assembly from this loop kernel follows below.

Something similar is done within the inverse horizontal wavelet function invwave_horz, to the same effect. The synthesis filter arrays are not flipped in daub4_optimized. c. In Listing B-10 the original "upsample and convolution" loop from invwave_horz is shown. We can nudge the compiler into using DOTP2 through the use of four two-element arrays which serve as a reordering mechanism. This trick is shown in Listing B-l 1. It turns out that with the flipping of the filter coefficient array for wave horz, and the appropriate array indexing modifications made within wave vert to account for this change, nothing more needs to be done to optimize the vertical DWT. Optimization of invwave_vert is left as an exercise for the reader.

Listing B-10: Portions of invwave_horz from daub4 . c (see Listing 616 for the complete code listing).

Listing B-10: Portions of invwave_horz from daub4 . c (see Listing 616 for the complete code listing).

Listing B-ll: Demonstrating the use of a reordering mechanism so as to get the compiler to use DOTP2. This code snippet is from the version of invwave horzindaub4 optimized.c.

Listing B-ll: Demonstrating the use of a reordering mechanism so as to get the compiler to use DOTP2. This code snippet is from the version of invwave horzindaub4 optimized.c.

 

 

 

Listing B-ll: Demonstrating the use of a reordering mechanism so as to get the compiler to use DOTP2. This code snippet is from the version of invwave horzindaub4 optimized.c.

Next post:

Previous post: