Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

Predication

Modulo scheduling techniques for CGRAs [ 20 , 22 , 25 , 48 , 54 , 55 ] only schedule

loops that are free of control flow transfers. Hence any loop body that contains

conditional statements first needs to be if-converted into hyperblocks by means of

predication [ 45 ] . For this reason, many CGRAs, including ADRES CGRAs, support

predication.

Hyperblock formation can result in very inefficient code if a loop body contains

code paths that are executed rarely. All those paths contribute to ResMII and

potentially to RecMII . Hence even paths that get executed very infrequently can

slow down a whole modulo-scheduled loop. Such loops can be detected with

profiling, and if the data dependencies allow this, it can be useful to split these loops

into multiple loops. For example, a first loop can contain the code of the frequently

executed paths only, with a lower II than the original loop. If it turns out during the

execution of this loop that in some iteration the infrequently executed code needs to

be executed, the first loop is exited, and for the remaining iterations a second loop is

entered that includes both the frequently and the infrequently executed code paths.

Alternatively, for some loops it is beneficial to have a so-called inspector loop

with very small II to perform only the checks for all iterations. If none of the

checks are positive, a second so-called executor loop is executed that includes all

the computations except the checks and the infrequently executed paths. If some

checks were positive, the original loop is executed.

One caveat with this loop splitting is that it causes code size expansion in the

CGRA instruction memories. For power consumption reasons, these memories are

kept as small as possible. This means that the local improvements obtained with the

loop splitting need to be balanced with the total code size of all loops that need to

share these memories.

Kernel-Only Loops

Predication can also be used to generate so-called kernel-only loop code. This is

loop code that does not have separate prologue and epilogue code fragments. Instead

the prologues and epilogues are included in the kernel itself, where predication

is now used to guard whole software pipeline stages and to ensure that only the

appropriate software pipeline stages are activated at each point in time. A traditional

loop with a separate prologue and epilogue is compared to a kernel-only loop in

Fig. 8 . Three observations need to be made here.

The first observation is that kernel-only code is usually faster because the pipeline

stages of the prologue and epilogue now get executed on the CGRA accelerator,

which typically can do so at much higher IPCs than the main core. This is a major

difference between (ADRES) CGRAs and VLIWs. On the latter, kernel-only loops

are much less useful because all code runs on the same number of ISs anyway.

Secondly, while kernel-only code will be faster on CGRAs, more time is spent

in the CGRA mode, as can be seen in Fig. 8 . During the epilogue and prologue, the

Search WWH ::

Custom Search

Home