Application-Specific Accelerators for Communications - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

wireless system, it is very important to develop area and power efficient 4G wireless

receivers. Given area and power constraints for the mobile handsets one can not

simply implement computation intensive DSP algorithms with gigahertz DSPs.

Besides, it is also critical to reduce base station power consumption by utilizing

optimized hardware accelerator design.

In this second edition, we will describe a few DSP algorithms which dominate the

main computational complexity in a wireless receiver. These algorithms, including

Viterbi decoding, Turbo decoding, LDPC decoding, MIMO detection, and channel

equalization/FFT, need to be off-loaded to hardware coprocessors or accelerators,

yielding high performance. These hardware accelerators are often integrated in the

same die with DSP processors. In addition, it is also possible to leverage the field-

programmable gate array (FPGA) to provide reconfigurable massive computation

capabilities, as described in other chapter of this handbook [ 40 ] .

DSP workloads are typically numerically intensive with large amounts of both

instruction and data level parallelism. In order to exploit this parallelism with a

programmable processor, most DSP architectures utilize Very Long Instruction

Word, or VLIW architectures. VLIW architectures typically include one or more

register files on the processor die, versus a single monolithic register file as

is often the case in general-purpose computing. Examples of such architectures

are the Freescale StarCore processor, the Texas Instruments TMS320C6x series

DSPs as well as SHARC DSPs from Analog Devices, to name a few [ 3 , 22 , 63 ] .

A comprehensive overview of the general-purpose DSP processors is given in

Chapter of this handbook [ 58 ] .

In some cases due to the idiosyncratic nature of many DSPs, and the imple-

mentation of some of the more powerful instructions in the DSP core, an optimizing

compiler can not always target core functionality in an optimal manner. Examples of

this include high performance fractional arithmetic instructions, for example, which

may perform highly SIMD functionality which the compiler can not always deem

safe at compile time.

While the aforementioned VLIW based DSP architectures provide increased

parallelism and higher numerical throughput performance, this comes at a cost

of ease in programmability. Typically such machines are dependent on advanced

optimizing compilers that are capable of aggressively analyzing the instruction and

data level parallelism in the target workloads, and mapping it onto the parallel

hardware. Due to the large number of parallel functional units and deep pipeline

depths, modern DSP are often difficult to hand program at the assembly level while

achieving optimal results. As such, one technique used by the optimizing compiler

is to vectorize much of the data level parallelism often found in DSP workloads. In

doing this, the compiler can often fully exploit the single instruction multiple data,

or SIMD functionality found in modern DSP instruction sets.

Despite such highly parallel programmable processor cores and advanced com-

piler technology, however, it is quite often the case that the amount of available

instruction and data level parallelism in modern signal processing workloads far

exceeds the limited resources available in a VLIW based programmable processor

core. For example, the implementation complexity for a 40 Kbps DS-CDMA system

Signal Processing Systems

Search WWH ::

Custom Search

Home