Application-Specific Accelerators for Communications - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

Data level parallelism is another important criteria in determining the partition-

ing for a given application. Many applications targeted at VLIW-like architectures,

especially signal processing applications, exhibit a large amount of both instruction

and data level parallelism [ 32 ] . Many signal processing applications often contain

enough data level parallelism to exceed the available functional units of a given

architecture. FPGA fabrics and highly parallel ASIC implementations can exploit

these computational bottlenecks in the input application by providing not only large

numbers of functional units but also large amounts of local block data RAM to

support very high levels of instruction and data parallelism, far beyond that of

what a typical VLIW signal processing architecture can afford in terms of register

file real estate. Furthermore, depending on the instruction set architecture of the

host processor or DSP, performing sub-word or multiword operations may not be

feasible given the host machine architecture. Most modern DSP architectures have

fairly robust instruction sets that support fine grained multiword SIMD acceleration

to a certain extent. It is often challenging, however, to efficiently load data from

memory into the register files of a programmable SIMD style processor to be able

to efficiently or optimally utilize the SIMD ISA in some cases.

Computational complexity of the application often bounds the programmable

DSP core, creating a compute bottleneck in the system. Algorithms that are

implemented in FPGA are often computationally intensive, exploiting greater

amounts of instruction and data level parallelism than the host processor can afford,

given the functional unit limitations and pipeline depth. By mapping computation-

ally intense bottlenecks in the application from software implementation executing

on host processor to hardware implementation in FPGA, one can effectively

alleviate bottlenecks on the host processor and permit extra cycles for additional

computation or algorithms to execute in parallel.

Task level parallelism in a portion of the application can play a role in the ideal

partitioning as well. Quite often, embedded applications contain multiple tasks that

can execute concurrently, but have a limited amount of instruction or data level

parallelism within each unique task [ 69 ] . Applications in the networking space,

and baseband processing at layers above the data plane typically need to deal with

processing packets and traversing packet headers, data descriptors and multiple task

queues. If the given task contains enough instruction and data level parallelism

to exhaust the available host processor compute resources, it can be considered

for partitioning to an accelerator. In many cases, it is possible to concurrently

execute multiple of these tasks in parallel, either across multiple host processors

or across both host processor and FPGA compute engine depending on data access

patterns and cross task data dependencies. There are a number of architectures

which have accelerated tasks in the control plane, versus data plane, in hardware.

One example of this is the Freescale Semiconductor QorIQ platform which provides

hardware acceleration for frame managers, queue managers, and buffer managers.

In doing this, the architecture effectively frees the programmable processor cores

from dealing with control plane management.

Signal Processing Systems

Search WWH ::

Custom Search

Home