Digital Signal Processing Reference
In-Depth Information
Tabl e 1 (a) Schedule generated by an early version of the TI-C compiler (12 cycles)
[ 65 ] and (b) optimal schedule generated by OPTIMIST with dynamic programming
(9 cycles) [ 58 ]
(a)
(b)
LDW.D1 *A4,B4
LDW.D1 *A0,A8 || MV.L2X A1,B8
LDW.D1 *A1,A8
LDW.D2 *B8,B1 || LDW.D1 *A2,A9 || MV.L2X A3,B10
LDW.D1 *A3,A9
LDW.D2 *B10,B3 || LDW.D1 *A4,A10 || MV.L2X A5,B12
LDW.D1 *A0,B0
LDW.D2 *B12,B5 || LDW.D1 *A6,A11 || MV.L2X A7,B14
||
LDW.D1 *A2,B2
LDW.D2 *B14,B7
MV.L2X A8,B0
LDW.D1 *A5,B5
MV.L2X A9,B2
LDW.D1 *A7,A4
MV.L2X A10,B4
LDW.D1 *A6,B6
MV.L2X A11,B6
NOP 1
NOP 1 ; (last delay slot of LDW to B7)
MV.L2X A8,B1
MV.L2X A9,B3
MV.L2X A4,B7
denotes an addition on the .S2 unit that accesses one A register via the cross path
2X . In total, there are 12 different instructions for addition (not counting the ADD2
option for 16-bit additions).
It becomes apparent that the problems of resource allocation (including cluster
assignment) and instruction scheduling are not independent but should preferably
be handled together to improve code quality. An example (adapted from Leupers
[ 65 ] )isshowninTable 1 : a basic block consisting of eight independent load ( LDW )
instructions is to be scheduled. The address operands are initially available (i.e.,
live on entry) in registers A0,...,A7 in register file A, the results are expected
to be written to registers B0,...,B7 (i.e., live on exit) in register file B. Load
instructions execute on the cluster containing the address operand register. The result
can be written to either register file. However, only one load or store instruction
can access a register file per clock cycle to write its destination register resp. read
its source register; otherwise, the processor stalls for one clock cycle to serialize
the competing accesses. Copying registers between clusters (which occupies the
corresponding cross path) can be done by Move ( MV ), which is a shorthand for ADD
with one zero operand, and has latency 1. As the processor has 2 load/store units
and load has latency 5, a lower bound for the makespan (the time until all results
are available) is eight clock cycles; it can be sharpened to nine clock cycles if we
consider that at least one of the addresses has to be moved early to register file B
to enable parallel computing, which takes one more clock cycle. A naive solution
(a) sequentializes the computation by executing all load instructions on cluster A
only. A more advanced schedule (b) utilizes both load/store units in parallel by
transporting four of the addresses to cluster B as soon as possible, so the loads
can run in parallel. Note also that no implicit pipeline stalls occur as the parallel
load instructions always target different destination register files in their write-back
phase, five clock cycles after issue time. Indeed, (b) is an optimal schedule; it was
computed by the dynamic programming algorithm in OPTIMIST [ 58 ] . Generally,
 
 
Search WWH ::




Custom Search