Compiling for VLIW DSPs - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

Tabl e 1 (a) Schedule generated by an early version of the TI-C compiler (12 cycles)

[ 65 ] and (b) optimal schedule generated by OPTIMIST with dynamic programming

(9 cycles) [ 58 ]

(a)

(b)

LDW.D1 *A4,B4

LDW.D1 *A0,A8 || MV.L2X A1,B8

LDW.D1 *A1,A8

LDW.D2 *B8,B1 || LDW.D1 *A2,A9 || MV.L2X A3,B10

LDW.D1 *A3,A9

LDW.D2 *B10,B3 || LDW.D1 *A4,A10 || MV.L2X A5,B12

LDW.D1 *A0,B0

LDW.D2 *B12,B5 || LDW.D1 *A6,A11 || MV.L2X A7,B14

||

LDW.D1 *A2,B2

LDW.D2 *B14,B7

MV.L2X A8,B0

LDW.D1 *A5,B5

MV.L2X A9,B2

LDW.D1 *A7,A4

MV.L2X A10,B4

LDW.D1 *A6,B6

MV.L2X A11,B6

NOP 1

NOP 1 ; (last delay slot of LDW to B7)

MV.L2X A8,B1

MV.L2X A9,B3

MV.L2X A4,B7

denotes an addition on the .S2 unit that accesses one A register via the cross path

2X . In total, there are 12 different instructions for addition (not counting the ADD2

option for 16-bit additions).

It becomes apparent that the problems of resource allocation (including cluster

assignment) and instruction scheduling are not independent but should preferably

be handled together to improve code quality. An example (adapted from Leupers

[ 65 ] )isshowninTable 1 : a basic block consisting of eight independent load ( LDW )

instructions is to be scheduled. The address operands are initially available (i.e.,

live on entry) in registers A0,...,A7 in register file A, the results are expected

to be written to registers B0,...,B7 (i.e., live on exit) in register file B. Load

instructions execute on the cluster containing the address operand register. The result

can be written to either register file. However, only one load or store instruction

can access a register file per clock cycle to write its destination register resp. read

its source register; otherwise, the processor stalls for one clock cycle to serialize

the competing accesses. Copying registers between clusters (which occupies the

corresponding cross path) can be done by Move ( MV ), which is a shorthand for ADD

with one zero operand, and has latency 1. As the processor has 2 load/store units

and load has latency 5, a lower bound for the makespan (the time until all results

are available) is eight clock cycles; it can be sharpened to nine clock cycles if we

consider that at least one of the addresses has to be moved early to register file B

to enable parallel computing, which takes one more clock cycle. A naive solution

(a) sequentializes the computation by executing all load instructions on cluster A

only. A more advanced schedule (b) utilizes both load/store units in parallel by

transporting four of the addresses to cluster B as soon as possible, so the loads

can run in parallel. Note also that no implicit pipeline stalls occur as the parallel

load instructions always target different destination register files in their write-back

phase, five clock cycles after issue time. Indeed, (b) is an optimal schedule; it was

computed by the dynamic programming algorithm in OPTIMIST [ 58 ] . Generally,

Signal Processing Systems

Search WWH ::

Custom Search

Home