Digital Signal Processing Reference
In-Depth Information
Synthesis results for the lattice filter using a Xilinx Virtex ® -5 XC5VSX50T
Tabl e 7
FPGA
DSP
blocks
Clock
(MHz)
Throughput
(MHz)
Circuit
LUTs
Original (Fig. 8 )
4
109
88
88
Scaled and retimed (Fig. 9 d )
4
-
266
53.2
Hardware shared (Fig. 10 )
1
50
240
48
the cut will result in a delay of
D on edge A 1- M 1, so to preempt this we transfer
3 D from the output in cut-set#5 and transfer the delays as shown, giving Fig. 9 d .
The presence of a single delay on each output edge means that the circuit can now
be pipelined at the processor level.
Synthesis results of the original function and the pipelined version using the
Xilinx Virtex ® -5 FPGA technology is given in Table 7 . The addition of the pipeline
allows a neat mapping to the DSP48E processors, allowing the blocks to be fully
used and reducing the amount of LUTs needed. Of course, the throughput is reduced
as inputs are only required once very four clock cycles although the clock rate has
increased considerably; this will be addressed later.
4
Circuit Architecture Optimization
The techniques so far have involved the use of pipelining to create the required
sampling rates, but some applications operate at lower sampling rates than those
indicated in Tables 6 and 7 . In addition, redundancy may have occurred as a result
of pipelining as illustrated for the lattice filter example which runs at a high clock
rate but at comparable lower sampling rates when compared to the original design.
In these scenarios, the aim would be to share the hardware by folding.
4.1
Folding
In folding, the aim is to perform hardware sharing on the DFG graph thereby
allowing trade-offs to be made at a reasonably high level as it is relatively
straightforward to work out the clock rate for the lattice filter after pipelining
(by computing the delay for a pipelined multiplication and accumulation), without
the need to perform FPGA synthesis. Given that a scaling factor of 5 has been
applied to the lattice filter, it is clear that we can schedule each multiplication
and addition operation to take place at different cycles; so it is possible to share
hardware between these operations using one multiplier and adder without losing
performance. This is achieved by folding [ 11 ] which was described in detail in
 
 
 
Search WWH ::




Custom Search