Digital Signal Processing Reference
In-Depth Information
Tabl e 4 Examples of kernel DSP algorithms running on a single MAC DSP
Algorithm kernel
Descriptions or specifications
Typical cycle cost
Block transfer
Move N data words from one memory to another
3 N
+
3
256p FFT
256 point FFT including computing and data access
11 , 000
Single FIR
A N-tap FIR filter running one sample
N + 12
Frame FIR
A N-tap FIR filter running K samples
K ( N + 6 )+ 8
Complex data FIR
A N-tap complex data FIR filter running one sample
8 N + 15
LMS Adaptive FIR
A N-tap least significant square adaptive filter
3 N + 10
16/16 bits division
A positive 16bits divided by a 16bits positive data
50
+
Vector add
C[i]
A[i] + B[i] Here i is from 0 to N-1.
3
3 N
+
Vector window
C[i]
A[i] * B[i] Here i is from 0 to N-1.
3
3 N
Vector Max
R
MAX A[i] Here i is from 0 to N-1.
2
+
2 N
The benchmarks of basic DSP algorithms are usually written in assembly
language. However, if the firmware design time (time to market) is very short,
benchmarks written in high-level language will be necessary. In this case, the
benchmarking checks the mixed qualities of the instruction set and compiler.
BDTI (Berkeley Design Technology Incorporation [ 6 ] ) supplies benchmarks
based on handwritten assembly code. EEMBC (EDN embedded microprocessor
benchmark consortium [ 7 ] ) allows two scoring methods: Out-of-the-box bench-
marking and Full-Fury benchmarking. Out-of-the-box (do not requiring any extra
effort) benchmarking is based on the assembly code directly generated by the
compiler. Full-Fury (also called optimized) benchmarking is based on assembly
code generated and fine tuned by experienced programmers.
It is not easy to make an ideally fair comparison by benchmarking low level
algorithm kernels on target processors. Each processor has dedicated features and
is optimized for some algorithms, while not optimized for some other algorithms.
A processor holding a poor benchmarking record of an application might have
very good benchmarking record of another application. A typical case is that a
radio baseband processor will never be used as a video decoder processor. For fair
comparison, processors from different categories should not be compared.
DSP kernel algorithms consist of 10% of an application code that takes 90%
of the runtime. Benchmarking on kernels is relevant because DSP kernel algorithms
will take the majority of the execution time in most DSP applications. Well accepted
DSP kernels are listed in Table 4 . In the table, the typical cycle cost is measured on
a DSP processor with single MAC unit and two separated memory blocks, a simple
and typical DSP processor. It exposes the average performance among single MAC
commercial DSP processors. If the benchmarking result of an ASIP designed by
you is much behind scores in the table, you may need to understand why and try to
improve your design.
Cycle cost and code cost consist of three parts while coding and running a kernel
benchmarking subroutine, the prolog, the kernel, and the epilog. The prolog is
the code to prepare and start running a kernel algorithm, it includes loading and
configuring parameters of the algorithm. The kernel is the code of the algorithm,
 
 
Search WWH ::




Custom Search