Digital Signal Processing Reference
In-Depth Information
of input and output pins that can be accessed in parallel, this implementation is
not suitable for supporting multiple PEs. In addition, the resampling process is
inherently memory-centric. Hence, typical DSP suffers from extensive memory
accesses which seriously degrade the throughput of the particle filtering. Moreover,
standard addressing schemes on standard buses are not suitable for handling non-
deterministic data exchanges among the processing elements. On the other hand,
commercial FPGAs are viable since they provide enough I/O pins for supporting
have fast logic elements, flexible interconnects, and memory. However, for high-
throughput designs with the low-complexity that supports non-deterministic data
exchanges among the processing elements, we consider VLSI implementations.
Here, we present a VLSI design and implementation of a flexible resampling
mechanism. The architecture supports configurations with 2 or 4 PEs. With 4
PEs, three different subconfigurations are supported where the difference is in the
performance and throughput tradeoff. The architecture is designed for tracking
the
resampling
process is identical. The main difference will be in the number
of input and output pins, and the size of buffers. Static dual-ported SRAM is
incorporated to maintain high throughput.
In this chapter, we also consider the fixed-point processing issue for multiple
PEs. An efficient mechanism for single PE in fixed-point processing of a particle
time of a fully pipelined particle filtering including resampling is 2
MT
PE
,where
M
is the total number of particles dedicated for the resampling, and
T
PE
is the
execution clock period. Operational concurrency in particle filters, other than the
resampling, can be exploited in the algorithm, which can be parallelized. However,
the resampling requires a sequential processing, which negates the benefit of parallel
processing. This is because the resampling has to consider all the
M
particles for
their correct replication. For simple parallel processing with
P
PEs, the execution
time for
M
particles can be represented as
T
PE
is the
time for concurrent parallel processing of filtering operations other than resampling,
throughput is lower bounded by
MT
PE
, even with infinite number of PEs. On the
other hand, resampling can be done locally within each PE in parallel, where the PEs
resample their own
M
[
M
/
P
+
M
]
T
PE
where
[
M
/
P
]
/
P
particles. In this case, the execution time can be reduced
to
T
PE
. However, such parallel processing has a serious limitation. Particles
will be highly localized within each PE (i.e., bad particles will stay in the same PE
if not enough replicated particles or some of the good particles will be discarded if
there are more replicated particles in the PE). Thus, serious weight degeneracy may
occur. For example, two particles in two different PEs may have the same weights,
but their replication factors, which indicate the number of times that one particle
should be replicated based on the decimal equivalent values of the weights, may
differ significantly.
[
2
M
/
P
]