Mapping Decidable Signal Processing Graphs into FPGA Implementations - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

Synthesis results for the lattice filter using a Xilinx Virtex ® -5 XC5VSX50T

Tabl e 7

FPGA

DSP

blocks

Clock

(MHz)

Throughput

(MHz)

Circuit

LUTs

Original (Fig. 8 )

4

109

88

Scaled and retimed (Fig. 9 d )

4

-

266

53.2

Hardware shared (Fig. 10 )

1

50

240

48

the cut will result in a delay of

D on edge A 1- M 1, so to preempt this we transfer

3 D from the output in cut-set#5 and transfer the delays as shown, giving Fig. 9 d .

The presence of a single delay on each output edge means that the circuit can now

be pipelined at the processor level.

Synthesis results of the original function and the pipelined version using the

Xilinx Virtex ® -5 FPGA technology is given in Table 7 . The addition of the pipeline

allows a neat mapping to the DSP48E processors, allowing the blocks to be fully

used and reducing the amount of LUTs needed. Of course, the throughput is reduced

as inputs are only required once very four clock cycles although the clock rate has

increased considerably; this will be addressed later.

−

4

Circuit Architecture Optimization

The techniques so far have involved the use of pipelining to create the required

sampling rates, but some applications operate at lower sampling rates than those

indicated in Tables 6 and 7 . In addition, redundancy may have occurred as a result

of pipelining as illustrated for the lattice filter example which runs at a high clock

rate but at comparable lower sampling rates when compared to the original design.

In these scenarios, the aim would be to share the hardware by folding.

4.1

Folding

In folding, the aim is to perform hardware sharing on the DFG graph thereby

allowing trade-offs to be made at a reasonably high level as it is relatively

straightforward to work out the clock rate for the lattice filter after pipelining

(by computing the delay for a pipelined multiplication and accumulation), without

the need to perform FPGA synthesis. Given that a scaling factor of 5 has been

applied to the lattice filter, it is clear that we can schedule each multiplication

and addition operation to take place at different cycles; so it is possible to share

hardware between these operations using one multiplier and adder without losing

performance. This is achieved by folding [ 11 ] which was described in detail in

Signal Processing Systems

Search WWH ::

Custom Search

Home