Digital Signal Processing Reference
In-Depth Information
Data level parallelism is another important criteria in determining the partition-
ing for a given application. Many applications targeted at VLIW-like architectures,
especially signal processing applications, exhibit a large amount of both instruction
and data level parallelism [ 32 ] . Many signal processing applications often contain
enough data level parallelism to exceed the available functional units of a given
architecture. FPGA fabrics and highly parallel ASIC implementations can exploit
these computational bottlenecks in the input application by providing not only large
numbers of functional units but also large amounts of local block data RAM to
support very high levels of instruction and data parallelism, far beyond that of
what a typical VLIW signal processing architecture can afford in terms of register
file real estate. Furthermore, depending on the instruction set architecture of the
host processor or DSP, performing sub-word or multiword operations may not be
feasible given the host machine architecture. Most modern DSP architectures have
fairly robust instruction sets that support fine grained multiword SIMD acceleration
to a certain extent. It is often challenging, however, to efficiently load data from
memory into the register files of a programmable SIMD style processor to be able
to efficiently or optimally utilize the SIMD ISA in some cases.
Computational complexity of the application often bounds the programmable
DSP core, creating a compute bottleneck in the system. Algorithms that are
implemented in FPGA are often computationally intensive, exploiting greater
amounts of instruction and data level parallelism than the host processor can afford,
given the functional unit limitations and pipeline depth. By mapping computation-
ally intense bottlenecks in the application from software implementation executing
on host processor to hardware implementation in FPGA, one can effectively
alleviate bottlenecks on the host processor and permit extra cycles for additional
computation or algorithms to execute in parallel.
Task level parallelism in a portion of the application can play a role in the ideal
partitioning as well. Quite often, embedded applications contain multiple tasks that
can execute concurrently, but have a limited amount of instruction or data level
parallelism within each unique task [ 69 ] . Applications in the networking space,
and baseband processing at layers above the data plane typically need to deal with
processing packets and traversing packet headers, data descriptors and multiple task
queues. If the given task contains enough instruction and data level parallelism
to exhaust the available host processor compute resources, it can be considered
for partitioning to an accelerator. In many cases, it is possible to concurrently
execute multiple of these tasks in parallel, either across multiple host processors
or across both host processor and FPGA compute engine depending on data access
patterns and cross task data dependencies. There are a number of architectures
which have accelerated tasks in the control plane, versus data plane, in hardware.
One example of this is the Freescale Semiconductor QorIQ platform which provides
hardware acceleration for frame managers, queue managers, and buffer managers.
In doing this, the architecture effectively frees the programmable processor cores
from dealing with control plane management.
Search WWH ::




Custom Search