Application Specific Instruction Set DSP Processors - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

Tabl e 4 Examples of kernel DSP algorithms running on a single MAC DSP

Algorithm kernel

Descriptions or specifications

Typical cycle cost

Block transfer

Move N data words from one memory to another

∼

3 N

256p FFT

256 point FFT including computing and data access

∼ 11 , 000

Single FIR

A N-tap FIR filter running one sample

∼ N + 12

Frame FIR

A N-tap FIR filter running K samples

∼ K ( N + 6 )+ 8

Complex data FIR

A N-tap complex data FIR filter running one sample

∼ 8 N + 15

LMS Adaptive FIR

A N-tap least significant square adaptive filter

∼ 3 N + 10

16/16 bits division

A positive 16bits divided by a 16bits positive data

∼ 50

≤

∼

Vector add

C[i]

A[i] + B[i] Here i is from 0 to N-1.

3 N

≤

∼

Vector window

C[i]

A[i] * B[i] Here i is from 0 to N-1.

3 N

Vector Max

≤

MAX A[i] Here i is from 0 to N-1.

∼

2 N

The benchmarks of basic DSP algorithms are usually written in assembly

language. However, if the firmware design time (time to market) is very short,

benchmarks written in high-level language will be necessary. In this case, the

benchmarking checks the mixed qualities of the instruction set and compiler.

BDTI (Berkeley Design Technology Incorporation [ 6 ] ) supplies benchmarks

based on handwritten assembly code. EEMBC (EDN embedded microprocessor

benchmark consortium [ 7 ] ) allows two scoring methods: Out-of-the-box bench-

marking and Full-Fury benchmarking. Out-of-the-box (do not requiring any extra

effort) benchmarking is based on the assembly code directly generated by the

compiler. Full-Fury (also called optimized) benchmarking is based on assembly

code generated and fine tuned by experienced programmers.

It is not easy to make an ideally fair comparison by benchmarking low level

algorithm kernels on target processors. Each processor has dedicated features and

is optimized for some algorithms, while not optimized for some other algorithms.

A processor holding a poor benchmarking record of an application might have

very good benchmarking record of another application. A typical case is that a

radio baseband processor will never be used as a video decoder processor. For fair

comparison, processors from different categories should not be compared.

DSP kernel algorithms consist of 10% of an application code that takes 90%

of the runtime. Benchmarking on kernels is relevant because DSP kernel algorithms

will take the majority of the execution time in most DSP applications. Well accepted

DSP kernels are listed in Table 4 . In the table, the typical cycle cost is measured on

a DSP processor with single MAC unit and two separated memory blocks, a simple

and typical DSP processor. It exposes the average performance among single MAC

commercial DSP processors. If the benchmarking result of an ASIP designed by

you is much behind scores in the table, you may need to understand why and try to

improve your design.

Cycle cost and code cost consist of three parts while coding and running a kernel

benchmarking subroutine, the prolog, the kernel, and the epilog. The prolog is

the code to prepare and start running a kernel algorithm, it includes loading and

configuring parameters of the algorithm. The kernel is the code of the algorithm,

Signal Processing Systems

Search WWH ::

Custom Search

Home