Digital Signal Processing Reference
In-Depth Information
a
b
c
A(1);
FOR i FROM 1 TO N-1 DO
B(i);
C(i) || A(i+1);
END DO
B(N);
C(N);
FORiFROM1TONDO
A(i);
B(i);
C(i);
END DO
A
+1
B
C
Fig. 11
Simple example: ( a ) Original loop, where A
(
i
)
, B
(
i
)
, C
(
i
)
denote operations in the loop
body that may compete for common resources, in this example B
(
i
)
and C
(
i
)
, and that may involve
both loop-independent data dependences, here A
(
i
)
B
(
i
)
and A
(
i
)
C
(
i
)
, and loop-carried data
dependences, here B
(
i
)
A
(
i
+
1
)
, see the dependence graph in ( b ). ( c ) After software pipelining
Software pipelining has been researched intensively, both as a high-level loop
transformation (performed in the middle end of a compiler or even as source-
to-source program transformation) and as low-level optimization late in the code
generation process (performed in the back end of a compiler), after instruction
selection with resource allocation has been performed. The former approaches are
independent of particular instructions and functional units to be selected for all
operations in the loop, and thus have to rely on inaccurate cost estimations for
execution time, energy, or register pressure when comparing various alternatives,
while the actual cost will also depend on decisions made late in the code generation
process. The latter approaches are bound to fixed instructions and functional units,
and hence the flexibility of implementing the same abstract operation by a variety
of different target machine instructions, with different resource requirements and
latency behavior, is lost. In either case, optimization opportunities are missed
because interdependent problems are solved separately in different compiler phases.
Approaches to integrate software pipelining with other code generation tasks will be
discussed in Sect. 8 .
Software pipelining , also called cyclic scheduling , transforms a loop such as that
in Fig. 11 a into an equivalent loop whose body contains operations from different
iterations of the original loop, such as that in Fig. 11 c , which may result in faster
code on an instruction-level parallel architecture as e.g. C
could
be executed in parallel ( || ) because they are statically known to be independent
of each other and not to subscribe to the same hardware resources. This parallel
execution was not possible in the original version of the loop because the code
generator usually treats the loop body (a basic block) as a unit for scheduling and
resource allocation, and furthermore separates the code for C
(
i
)
and A
(
i
+
1
)
by
a backward branch to the loop entry. The body of the transformed loop is called
the kernel , the operations before the kernel that “fill” the pipeline (here A
(
i
)
and A
(
i
+
1
)
)are
called the prologue , and the operations after the kernel that “drain” the pipeline (here
B
(
1
)
), are called the epilogue of the software-pipelined loop. Software
pipelining thus overlaps in the new kernel the execution of operations originating
from different iterations of the original loop, as far as permitted by given dependence
and resource constraints, in order to solicit more opportunities for parallel execution
(
N
)
and C
(
N
)
 
Search WWH ::




Custom Search