Architectures for Particle Filtering - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

of input and output pins that can be accessed in parallel, this implementation is

not suitable for supporting multiple PEs. In addition, the resampling process is

inherently memory-centric. Hence, typical DSP suffers from extensive memory

accesses which seriously degrade the throughput of the particle filtering. Moreover,

standard addressing schemes on standard buses are not suitable for handling non-

deterministic data exchanges among the processing elements. On the other hand,

commercial FPGAs are viable since they provide enough I/O pins for supporting

concurrent data exchanges with the processing elements [ 16 , 22 ] . Moreover, FPGAs

have fast logic elements, flexible interconnects, and memory. However, for high-

throughput designs with the low-complexity that supports non-deterministic data

exchanges among the processing elements, we consider VLSI implementations.

Here, we present a VLSI design and implementation of a flexible resampling

mechanism. The architecture supports configurations with 2 or 4 PEs. With 4

PEs, three different subconfigurations are supported where the difference is in the

performance and throughput tradeoff. The architecture is designed for tracking

applications [ 23 ] but can be modified to support different particle filtering because

the resampling process is identical. The main difference will be in the number

of input and output pins, and the size of buffers. Static dual-ported SRAM is

incorporated to maintain high throughput.

In this chapter, we also consider the fixed-point processing issue for multiple

PEs. An efficient mechanism for single PE in fixed-point processing of a particle

filter has been previously discussed [ 24 ] . It has also been shown that the execution

time of a fully pipelined particle filtering including resampling is 2 MT PE ,where

M is the total number of particles dedicated for the resampling, and T PE is the

execution clock period. Operational concurrency in particle filters, other than the

resampling, can be exploited in the algorithm, which can be parallelized. However,

the resampling requires a sequential processing, which negates the benefit of parallel

processing. This is because the resampling has to consider all the M particles for

their correct replication. For simple parallel processing with P PEs, the execution

time for M particles can be represented as

T PE is the

time for concurrent parallel processing of filtering operations other than resampling,

and MT PE is the time required by resampling [ 20 ] . Thus, the overall execution

throughput is lower bounded by MT PE , even with infinite number of PEs. On the

other hand, resampling can be done locally within each PE in parallel, where the PEs

resample their own M

[

M

/

P

+

M

]

T PE where

[

M

/

P

]

/

P particles. In this case, the execution time can be reduced

to

T PE . However, such parallel processing has a serious limitation. Particles

will be highly localized within each PE (i.e., bad particles will stay in the same PE

if not enough replicated particles or some of the good particles will be discarded if

there are more replicated particles in the PE). Thus, serious weight degeneracy may

occur. For example, two particles in two different PEs may have the same weights,

but their replication factors, which indicate the number of times that one particle

should be replicated based on the decimal equivalent values of the weights, may

differ significantly.

[

2 M

/

P

]

Signal Processing Systems

Search WWH ::

Custom Search

Home