Even though this topic has a narrow focus, it calls for a wide array of tools, some hardware (DSP development boards) but mostly software. It is the author’s strong belief that the development of embedded algorithms should proceed from a high-level vantage point down to the low-level environment, in a series of distinct, clearly defined milestones. The risk in jumping headlong into the embedded environment is getting bogged down in countless details and the inevitable unforeseen engineering challenges that may or may not be directly related to the algorithm. Embedded development is hard, and some may claim much harder than coding a web application, GUI application, Java program, or most other software intended to run on a desktop machine. Embedded developers are much closer to the hardware, and usually have fewer computing resources available at their disposal. The saving grace is that the programs typically running on an embedded DSP system are of a much smaller footprint than a desktop or server application. Nevertheless, when you are that much closer to the hardware there are many, many issues that must be taken into account and hence the development strategy put forth in this topic – start with the background, prototype whatever operation needs to be implemented, and slowly but surely work your way down to the DSP.
Although this description of embedded image processing development may appear to be characterized by fire and brimstone, in reality the overall situation has gotten much better over the years. Ease of development is proportional to the quality of the tools at your disposal, and in this topic all of the tools encountered in this topic are formally introduced. These include:
• The TMS320C6000 line of DSPs, in particular the C6416 DSP Starter Kit (DSK) and the C6701 Evaluation Module (EVM).
• Visual Studio .NET 2003, and the libraries used to build image processing applications that run on the various flavors of Microsoft Windows.
A small amount of background information on the TI line of DSPs is appropriate before jumping into the fray. Thus we first take a slight detour into the land of computer architecture and computer arithmetic, before getting down to business and describing the contents of our tool chest we use throughout the rest of this topic.
In 1997, Texas Instruments introduced the C6x generation of DSPs. These chips were unique in that they were the first DSPs to embrace the Very Long Instruction Word (VLIW) architecture1. This architectural aspect of the DSP, deemed VelociTI™, enabled a degree of parallelism previously unheard of in processors of its class and is the key to their high performance. The first members of the C6x family were the fixed-point C62x (C6201, C6211, etc.) and floating-point 67x (C6701, C6711, C6713, etc.) series. These DSPs feature eight functional units (two multipliers and six arithmetic logic units, or ALUs) and are capable of executing up to eight 32-bit instructions per cycle. The C62x/C67x was followed in 2001 by the C64x series of fixed-point DSPs, which represented a large step up in processor speeds (scalable up to 1.1 GHz versus approximately 250-300 MHz for the C62x/C67x at the time of this writing) and also introduced extensions to VelociTI that have important ramifications on high-performance image processing. Figure 1-1 shows block diagrams of both architectures, illustrating their common roots.
What Figure 1-1 does not show are all of the associated peripherals and interconnect structures that are equally important to understanding the C6000 architecture. This relationship is shown in Figure 1-2 for the case of the C62x/C67x DSP. Figure 1-2 shows a processor core surrounded by a variety of peripherals and banks of memory with data shuttling across separate program and data buses. The C6000 has a relatively simple memory architecture, with a flat, byte-addressable 4 GB address space, split into smaller sections mapped to different types of RAM (SDRAM or SBSRAM). As shown in both Figures 1-1 and 1-2, there are actually two data paths in the processor core, each containing four functional units. In the C62x/C67x, each data path has 16 32-bit registers, while in the C64x this register file is augmented with an additional 16 32-bit registers per data path. In addition, the C64x features an enhanced multiplier that doubles the 16-bit multiply rate of the C62x/C67x2. A full description of the entire C6x architecture is covered in [1-5], so in this section we instead focus on explaining the rationale and motivation behind VLIW and how it fits into the C6000 architecture, before moving to an important discussion on the difference between fixed-point and floating-point architectures. We conclude the section by introducing the two TI development environments used in this topic.
VLIW and VelociTI
In modern Complex Instruction Set Computer (CISC) processors and to a lesser extent, Reduce Instruction Set Computer (RISC) processors, there is an incredible amount of hardware complexity designed to exploit instruction level parallelism (ILP). With sincere apologies for the multitude of acronyms (at least you were warned!), the general concept is that deeply pipelined and superscalar GPPs, with their branch prediction and out-of-order execution, perform significant analysis on the instruction stream at run-time, and then transform this instruction stream wherever possible to keep the processor’s pipeline fully utilized. When the processor’s pipeline is full, it completes an instruction with every clock cycle. As one might imagine, this analysis is an extraordinarily difficult task, and while the compiler does do some things to alleviate the burden on the CPU, the CPU is very much in the loop with regards to optimizing the flow of instructions through the pipeline.
With VLIW on the other hand, the onus is completely on the compiler, and the processor relies on the compiler to provide it with a group of instructions that are guaranteed to not have any dependencies among them. Hence, the compiler, rather than the hardware, is the driving element behind taking advantage of any ILP. The VLIW concept is an outgrowth of vector processors, like the Cray supercomputers from the 1970s, which were based on this idea of the exact same operation being performed on an array of data. We can illustrate the rationale behind this concept, by considering at a very high-level the steps any processor must take to execute an instruction stream:
1. Fetch the next instruction.
2. Decode this instruction.
3. Execute this instruction.
Figure 2-1. C62x/C67x and C64x DSPs.
Figure 2-2. TMS320C62x/C67x block diagram, with peripherals and interconnect structure.
The classic example in vector processing is the addition of two vectors of length N. Instead of fetching, decoding, and executing the first instruction, fetching, decoding, and executing the second instruction, and so on up to N, why not perform steps 1 and 2 a single time and then execute this same instruction for each vector element? Hence we amortize the fetch and decode overhead over N iterations and the "Single Instruction, Multiple Data," or SIMD paradigm is born. Beginning in the 1980s, this idea was extended with VLIW architectures "as a somewhat liberated SIMD," with the various functional units of a processor "still in lockstep and under centralized control … each performing a different (through carefully preplanned) operation on different data elements."6 The operative term here is "carefully preplanned," as the compiler is now responsible for instruction scheduling. The compiler groups together multiple instructions that do not have any dependencies into a single VLIW instruction, or fetch packet, to borrow TI’s nomenclature. The processor, upon decoding this fetch packet, is now free to concurrently dispatch all of the fetch packet’s constituent instructions to available functional units for execution because there are no dependencies between any of them. Since the compiler has done the difficult work of creating the VLIW instruction stream, the processor can now be significantly simpler, as the massively complicated hardware blocks designed for run-time extraction of ILP present within CISC/RISC GPPs are no longer needed. This has two important ramifications for embedded DSPs: the clock speed increases while power consumption remains manageable.
The downside to VLIW is what happens if there are not enough independent instructions to completely fill up the fetch packet, or even worse, if the compiler is not intelligent enough perform the requisite code analysis and transformations to remove such dependencies should they appear (which they will). In this case, the compiler will be forced to pad the fetch packet with NOP (no operation) instructions. If the operations the source code lays out have a significant amount of dependencies between closely grouped instructions, the net effect is wasted processor resources. Memory is wasted with fetch packets consisting largely of NOPs, while available functional units lie idle, a veritable double whammy that completely stymies high-performance computing. Thankfully, in the world of DSP this situation is rare. For starters, compiler technology has progressed greatly since the advent of VLIW so compiler limitations are not so much of a concern nowadays, at least on mature architectures. But more significantly, the types of operations frequently encountered in DSP applications – highly repetitive numerical manipulations of large sequences of samples – are more often than not data parallel, meaning which there are comparatively fewer dependencies between successive instructions within a loop kernel. As a consequence, you have an ideal situation for the deployment of VLIW technology.
Texas Instruments broke with classical VLIW in an important aspect that led to the VelociTI technology at the heart of the C6x architecture. In conventional VLIW, the processor executes all instructions in the fetch packet with every clock cycle. Hence, if there are NOPs injected in the packet due to data dependencies, the processor’s resources are not fully utilized. With VelociTI, TI defined an execute packet as a group of instructions within the fetch packet that are in turn guaranteed to have no dependencies with respect to each other. Individual C6x instructions are 32-bits in length, and fetch packets consist of eight instructions, one per functional unit, for a total of 256 bits. Thus, even though all eight instructions are fetched at once, groups out of this eight can be dispatched simultaneously. For example, an eight-instruction fetch packet could be executed in the following manner:
• Clock 1: dispatch first three instructions
• Clock 2: dispatch next three instructions
• Clock 3: dispatch final two instructions
The key breakthrough was in allowing for the length of the execute packet to differ from that of the fetch packet. In the ideal case, we would hope that the execute packet equals the fetch packet, but at least TI has covered their tracks somewhat if this cannot be done. In this fashion, VelociTI mitigates the memory usage problem associated with traditional VLIW (in which the fetch packet must be padded with numerous NOPs for code that is not fully data parallel), while also yielding smaller code size.
TI made another step forward with their second generation VLIW-based architecture, VelociTI.2™, which was introduced alongside the C64x. This augmented technology included new specialized instructions for "packed data processing" for accelerating many of the key applications targeted by the C6x family. One of these applications is high-performance and low-cost image processing, and these architectural enhancements are utilized throughout the algorithms presented in this topic. With the proper utilization of VelociTI.2™ via sensible coding techniques and certain optimizations, the C64x both increases the computational throughput for digital media algorithms and applications, while further reducing code size.
One consequence of the C6x VLIW-based architecture is that there is no MAC instruction, per se. This will probably elicit a strong reaction, especially given the prominence accorded to this fundamental media processing operation (see 1.6). VelociTI largely eschews complex and compound instructions in favor of RISC-like simple instructions, and hence the MAC operation is handled as a multiply followed by an addition. The way the C6x makes up for this lack of an explicit MAC instruction is through the use of "software pipelining". The fixed-point chips are capable of performing two multiplies, four adds, and two address calculations per cycle, and so with a loop whose kernel consists of MAC operations (e.g. a dot product of two vectors) the processor pipelines the entire operation, that is, it begins successive operations before their predecessors have actually completed. This can only be done with loops where there are no loop-carried dependencies, and by overlapping operations across several iterations of a loop it becomes possible to achieve throughput rates of better than a MAC per cycle. In fact, this aspect of VLIW-based DSP highlights the need for different performance metrics than the standard Millions of Instructions Per Cycle (MIPS), which can be deceiving in light of the parallelism in a pipelined loop when it reaches its steady state.
Fixed-Point versus Floating-Point
The terms "fixed-point architecture" and "floating-point architecture" have been bandied about thus far without really having been fully explained. With floating-point arithmetic, the set of real numbers is represented using 32 (the single-precision float data type) or 64 (the double-precision double data type) bits by reserving fields for a fraction containing the number of significant digits (the mantissa), an exponent, and a sign bit. Here, the term "real number" refers to the mathematical notion of the ordered set of non-imaginary numbers with a decimal point. Floating-point numbers basically represent this set in scientific notation, and they are capable of representing a large range of values. The IEEE Standard 754 defines a format for the floating-point representation of real numbers7, and is the most widely used representation today.
Floating-point arithmetic is very powerful, and quite convenient for the programmer. Apart from having to worry about some nasty conditions like underflow, overflow, granularity, and some other special cases, for the most part floating-point arithmetic is transparent to the programmer. But this power and convenience comes at a cost. Performing floating-point arithmetic computations in hardware is far more complex than the equivalent integer operations. This may seem like the dark ages, but it was actually the case in the early 1980s that PCs based on the predecessor to the Pentium did not come standard with a "math coprocessor" whose job it was to accelerate floating-point math. The coprocessor was an option and without it floatingpoint arithmetic was performed in software. Nowadays of course, all Pentium-class processors come standard with the Intel math coprocessor, which in fact uses 80 bits (long double) to represent IEEE floating-point numbers. Processors which have the ability to perform floating-point arithmetic in hardware are floating-point architectures. The major advantage of fixed-point processors, where the various functional units only know how to deal with integer quantities and do not have innate support for floatingpoint numbers, is that the hardware becomes much simpler. The comparative simplicity of the hardware then makes it possible to crank up the clock speed of the processor while maintaining a handle on the overall power consumption.
Great, so in fixed-point architectures we are forever stuck only performing integer arithmetic, and there is no concept of a number with a decimal point? Clearly that is severely limiting and not very useful! Of course, floating-point math can still be implemented in software, using "bit banging" routines that perform the bit manipulations in code, but this strategy will have deleterious effects on overall system performance. We must be able to deal with real numbers and on fixed-point devices we use fixed-point arithmetic to manage an implicit decimal point. In essence, to draw an analogy with football (of the American variety), the processor has basically punted the problem of dealing with real numbers to the programmer.
With fixed-point arithmetic, this implicit decimal point, or "radix point" as it is commonly referred to, is placed somewhere in the middle of the digits in a binary encoding. The bits to the right of this "binal point" maintain a fixed number of fraction digits, and it is entirely up to the programmer to manage this binal point during subsequent arithmetic operations. What this entails is a propensity for numerous bit shifts in the code to align the operands correctly. While it makes for less aesthetically pleasing code, once you get the hang of it, the machinations of fixed-point arithmetic are really not that difficult to deal with. What can be difficult to deal with are the reduced accuracy and quantization effects that fixed-point arithmetic can have on numerical computations – these are algorithm-specific issues that must be tackled on a case-by-case basis, as we will do throughout this topic. An example fixed-point representation is shown in Figure 2-3, for the case of an 8-bit encoding.
With fixed-point arithmetic, we almost always use more than eight bits in order to gain additional dynamic range and fractional resolution. Programming some of the older non-C6x TI DSPs is actually an odd experience at first for C programmers, as these processors only deal with 32bit quantities. As a result, char, short, and int variables are all 32 bits in length.
Figure 2-3. Fixed-point representation of numbers. If the sign bit were flipped, then the final result would be negated, (a) A quantity akin to a normal (one byte) signed char, where the integer number is calculated according to the rules of a two’s complement format, (b) An example fixed-point representation using the same bits as above, with the radix point placed between the third and fourth least significant bits.
On the C6000 architecture, there are no such shenanigans and the size of the basic C integer data types are mostly what a C/C++ developer might expect, as shown in Table 2-1.
Table 2-1. TMS320C6000 C/C++ Integer Data Types. Note that in Visual C++ a long int is 32 bits in length, and a long long is 64 bits in length.
Fractional Q formats are one way of codifying fixed-point representations. One of the most common representations is the so-called "Q15" format, or more specifically Q0.15, where 16-bit short integers are treated as having their radix point between the most-significant bit (MSB) and the second MSB, leaving a fractional component of 15 bits. As described in , in general Qm.n fixed-point numbers split the m+n+1 bits in a binary encoding into two quantities, each of which are stored in two’s complement format: the m bits to the left of the radix point constitutes the integer portion of the quantity and the n bits to the right of the radix point is the fractional component of the quantity, a la Figure 2-3b.
The remaining bit is reserved as a sign bit, and thus Figure 2-3b is an example of a Q4.3 number. When interpreting a Q0.15 or more simply Q15 encoding of a floating-point quantity, this number takes on a value between -1.0 and 1.0. With fixed-point arithmetic, the proper scaling of the data is of the utmost importance. For example before conversion to the Q15 format the data must be normalized such that the maximum value is less than or equal to 1.0, and the minimum value is greater than or equal to -1.0. Conversion of a floating-point number into Q15 format is accomplished by first multiplying by 215 – which is equivalent to shifting left by 15 bits in an integer representation – and then truncating down to an integer, by lopping off whatever fractional component remains after the multiplication. In C, this translates in code to
if x is originally of type float or double, or alternatively
Now we can perform our arithmetic operations and the hardware is none the wiser. To convert back from Q15 format to a floating-point number we can use the following code:
In reality, things are not always so simple. It may be the case that Q15 numbers do not offer sufficient accuracy for the task at hand, and this implies some up front analysis on our part. But just as important, there are other cases that must be taken into account. First, consider the multiplication of two QO.N numbers:
If both Q,N and Q2N are stored in 16-bit words, then QMULT requires 32 bits of storage. Now suppose Q, and Q2 are still 16-bit words but have different Q formats, i.e.
Suppose X=7 and Y=8, then the result of the multiplication QMULT still requires 32 bits but now we have:
The rule of thumb for mixed Q multiplication is that QMULT has its radix point to the left of the (X+Y) most significant bit. So for the above example we now have a Q15 number stored in 32 bits. If QMULT is to be used as part of another computation further down the road, we will need to realign this number by bit shifting to isolate the fractional bits we wish to retain, taking extreme care to guard against overflow. This example assumes there are no integer bits (consider the case if QMULT = Qil.7 x Q22.7) and then things really start to get hairy. In some respects, the use of fixed-point arithmetic can be intimidating and has the potential to open up that veritable can of worms, but generally speaking, in image processing we usually will not encounter such situations. For the most part, the fixed-point programs in the topic follow this modus operandi:
1. Convert from a floating-point to a fixed-point representation that offers sufficient dynamic range and accuracy, by left bit-shifting.
2. Perform the numerical operations that constitute the image processing algorithm.
3. Scale back to the original input range (most often 0-255 for 8 bpp images) by shifting to the right however many bits we shifted to the left in step 1.
The majority of the embedded image processing programs in this topic are written for the fixed-point C6416 DSP. Those that are not ported to the floating-point C6701 DSP are easily done so. Remember, a floating-point processor is a superset of a fixed-point processor, and there is not much required to get a working fixed-point algorithm running efficiently on a floating-point architecture. The converse is definitely not true, as the compiler will be forced to inject slow software floating point routines wherever it encounters arithmetic between floating-point operands.
For further information on the vagaries of floating-point arithmetic, the reader is referred to [9-13].
TI DSP Development Tools (C6701 EVM and C6416 DSK)
The TMS320C6701 Evaluation Module is a PCI board with the 150 MHz floating-point TMS320C6701 DSP at its core14. Both the C6701 EVM, and its fixed-point cousin the C6201 EVM have DSPs with 64 KB of on-chip program and 64 KB of on-chip data memories. They also both have 256 KB of 133 MHz Synchronous Burst Static RAM (SBSRAM) and 8 MB of 100 MHz SDRAM in external (off-chip) memory banks. Aside from the fact that the C62x are fixed-point DSPs, another difference between the C67x and C62x is that the C67x is capable of taking advantage of double word-wide accesses via the LDDW instruction (see B.l.l), which is not available on the C62x.
The EVM boards also feature numerous interfaces for possible expansion, such as extra memory or peripheral daughter cards. However, as of late 2004 TI has discontinued the C6701 EVM product. This strategy is not because the C67x line is a dead-end, in contrast the latest generation of the C6x floating-point family is the C6713 that now reaches clock speeds of 300 MHz. Instead, TI is pushing their "DSP starter kit," or DSK (see Figure 2-4), as the avenue of choice for those wanting a cheap entry-level DSP development platform. The DSK is an external board that sits outside the host PC, and is expandable via daughter cards conforming to the TI TMS320 cross-platform daughtercard specification. This topic features the TMS320C6416 DSK15, which contains a 600 MHz C6416 DSP, with 1 MB of L2 internal memory and 16 MB of external SDRAM. One issue with the older DSKs was that they were somewhat slow, because their connectivity to the host PC was over a parallel port. This has ceased to be a problem because they now come in USB flavors, making them much faster -definitely a welcome development as single-stepping through a debugger over a parallel port on the older DSKs used to be painfully slow!
The EVM is useful in its own right because with it we can demonstrate usage of the host port interface (HPI), which is not available on the C6416 DSK but is used in other development boards and third-party products (of which there are many). However, it should be stressed that porting C6701 EVM code to another floating-point development platform such as the C6713 DSK is not a difficult task. Most of what will be needed involves changing the linker command file (see 3.2.2) to reflect the proper memory map and replacing some instances of the EVM support library with equivalent Chip Support Library (CSL) or DSP/BIOS API calls. Likewise, C6416 DSK to C67x DSK ports require modifications of the same magnitude.