Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems

Hardware Reference

In-Depth Information

Then the object is generated using the assembler and the linker. Since the FE-GA

is managed by CPUs, users need to create FE-GA control codes and attach them

to the reference program. Finally, the combined program for CPUs and FE-GA is

debugged on the integrated CPU and FE-GA simulator or on a real chip.

3.2.7

Implementation of Fast Fourier Transform on FE-GA

Fast Fourier transform (FFT), which is a common algorithm used in media process-

ing, was implemented on the FE-GA for evaluation. In this subsection, details of

the implementation are described. The algorithm used for mapping and evaluation

was a radix-2 decimation-in-time FFT, which is the most common form of the

Cooley-Tukey algorithm [ 53, 54 ]. We used this algorithm because the radix-2 FFT

is simple, and the decimation-in-time FFT has a multiplication of data and a twiddle

factor in the first part of calculation before fixed-point processing. It avoids having

to use wiring in order for supplying a twiddle factor into the middle of the cell array;

therefore, it preserves the resources of the cells for the fixed-point processing. The

format of input and output data is 16-bit fixed point (Q-15 format).

The FFT is calculated by repeating the butterfly calculation as follows (a decima-

tion-in-time algorithm):

a x yWb x yW

=+×

,

=−×

,

where a , b , x , and y are imaginary data and W is a twiddle factor. The equation can

be divided into a real part and an imaginary part as follows:

ar

=+× −×

xr

yr

Wr

yi

Wi ai

,

=+×+×

xi

yr

Wi

yi

Wr

,

br

=−× +×

xr

yr

Wr

yi

Wi bi

,

=−×−×

xi

yr

Wi

yi

Wr

.

Figure 3.57a shows a data flow graph of the above equations. The circled “×,”

“<<,” “+,” and “−,” respectively, indicate multiplication, 1-bit left shift, addition,

and subtraction. Because the data are 16-bit fixed point, the upper 16 bits of the

multiplied data with W should be left shifted. Figure 3.57b depicts a mapping of

the data flow graph on the 4 × 4 cells (the upper half part of the operation cell array).

A rectangle in each cell indicates an operation, and an arrow between rectangles

shows 16-bit data (dashed arrow is 1-bit data). In the rectangles, “D” inserts 1-cycle

delay. “ROTL” shifts 1 bit to the left and inserts an input 1-bit data into the LSB.

“MSB” outputs the MSB as 1-bit data, which is realized by an “add with carry”

operation. “~MSB” outputs the complement of the MSB as 1-bit data. Note that

the MLT cells, normally set in the second row of the cell array, are set in the first row

in order to map the FFT. After multiplications at the first row, the MSB of the lower

16-bit data is extracted, and the upper 16-bit datum with 1-cycle delay applied

is shifted 1 bit to the left with the MSB of the lower data attached to the LSB.

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home