Digital Signal Processing Reference
In-Depth Information
of input and output pins that can be accessed in parallel, this implementation is
not suitable for supporting multiple PEs. In addition, the resampling process is
inherently memory-centric. Hence, typical DSP suffers from extensive memory
accesses which seriously degrade the throughput of the particle filtering. Moreover,
standard addressing schemes on standard buses are not suitable for handling non-
deterministic data exchanges among the processing elements. On the other hand,
commercial FPGAs are viable since they provide enough I/O pins for supporting
concurrent data exchanges with the processing elements [ 16 , 22 ] . Moreover, FPGAs
have fast logic elements, flexible interconnects, and memory. However, for high-
throughput designs with the low-complexity that supports non-deterministic data
exchanges among the processing elements, we consider VLSI implementations.
Here, we present a VLSI design and implementation of a flexible resampling
mechanism. The architecture supports configurations with 2 or 4 PEs. With 4
PEs, three different subconfigurations are supported where the difference is in the
performance and throughput tradeoff. The architecture is designed for tracking
applications [ 23 ] but can be modified to support different particle filtering because
the resampling process is identical. The main difference will be in the number
of input and output pins, and the size of buffers. Static dual-ported SRAM is
incorporated to maintain high throughput.
In this chapter, we also consider the fixed-point processing issue for multiple
PEs. An efficient mechanism for single PE in fixed-point processing of a particle
filter has been previously discussed [ 24 ] . It has also been shown that the execution
time of a fully pipelined particle filtering including resampling is 2 MT PE ,where
M is the total number of particles dedicated for the resampling, and T PE is the
execution clock period. Operational concurrency in particle filters, other than the
resampling, can be exploited in the algorithm, which can be parallelized. However,
the resampling requires a sequential processing, which negates the benefit of parallel
processing. This is because the resampling has to consider all the M particles for
their correct replication. For simple parallel processing with P PEs, the execution
time for M particles can be represented as
T PE is the
time for concurrent parallel processing of filtering operations other than resampling,
and MT PE is the time required by resampling [ 20 ] . Thus, the overall execution
throughput is lower bounded by MT PE , even with infinite number of PEs. On the
other hand, resampling can be done locally within each PE in parallel, where the PEs
resample their own M
[
M
/
P
+
M
]
T PE where
[
M
/
P
]
/
P particles. In this case, the execution time can be reduced
to
T PE . However, such parallel processing has a serious limitation. Particles
will be highly localized within each PE (i.e., bad particles will stay in the same PE
if not enough replicated particles or some of the good particles will be discarded if
there are more replicated particles in the PE). Thus, serious weight degeneracy may
occur. For example, two particles in two different PEs may have the same weights,
but their replication factors, which indicate the number of times that one particle
should be replicated based on the decimal equivalent values of the weights, may
differ significantly.
[
2 M
/
P
]
Search WWH ::




Custom Search