Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

and operation scheduling. Similar techniques for wide loads and stores have also

been proposed in regular VLIW architectures for reducing power [ 60 ] . Exploiting

that hardware requires manual data layout optimizations as well.

Both Silicon Hive and PACT feature distributed memory blocks without a

crossbar. A Silicon Hive programmer has to specify the allocation of data to

the memory for the compiler to bind the appropriate load/store operations to the

corresponding memories. Silicon Hive also supports the possibility of interfacing

the memory or system bus using FIFO interfaces. This is efficient for streaming

processing but is difficult to interface when the data needs to be buffered on in case

of data reuse.

The ADRES architecture template provides a parameterizable Data Memory

Queue (DMQ) interface to each of the different single-ported, interleaved level-1

scratch-pad memory banks [ 19 ] . The DMQ interface is responsible for resolving

bank access conflicts, i.e., when multiple load/store ISs would want to access the

same bank at the same time. Connecting all load/store ISs to all banks through a

conflict resolution mechanism allows maximal freedom for data access patterns and

also maximal freedom on the data layout in memory. The potential disadvantage of

such conflict resolution is that it increases the latency of load operations. In software

pipelined code, however, increasing the individual latency of instructions most often

does not have a negative effect on the schedule quality, because the compiler can

hide those latencies in the software pipeline. In the main processor VLIW mode of

an ADRES, the same memories are accessed in code that is not software-pipelined.

So in that mode, the conflict resolution is disabled to obtain shorter access latencies.

Alternatively, a data cache can be added to the memory hierarchy to complement

the scratch-pad memories. By letting the compiler partition the data over the scratch-

pad memories and the data cache in an appropriate manner, high throughput can be

obtained in the CGRA mode, as well as low latency in the VLIW mode [ 32 ] .

3.6

Compiler Support

Apart from the specific algorithms used to compile code, the major distinctions

between the different existing compiler techniques relate to whether or not they

support static scheduling, whether or not they support dynamic reconfiguration,

whether or not they rely on special programming languages, and whether or not

they are limited to specific hardware properties. Because most compiler research has

been done to generate static schedules for CGRAs, we focus on those in this section.

As already indicated in Sects. 3.2.1 and 3.2.2 , many algorithms are based on FPGA

placement and routing techniques [ 7 ] in combination with VLIW code generation

techniques like modulo scheduling [ 39 , 61 ] and hyperblock formation [ 45 ] .

Whether or not compiler techniques rely on specific hardware properties is

not always obvious in the literature, as not enough details are available in the

descriptions of the techniques, and few techniques have been tried on a wide range

Search WWH ::

Custom Search

Home