Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

We can conclude by noting that, as in other architecture paradigms such as

VLIW processing or superscalar out-of-order execution, dynamically scheduled

CGRAs can deliver higher performance than statically scheduled ones for control-

intensive code with unpredictable behavior. On dynamically scheduled CGRAs the

code path that gets executed in an iteration determines the execution time of that

iteration, whereas on statically scheduled CGRAs, the combination of all possible

execution paths (including the slowest path which might be executed infrequently)

determines the execution time. Thus, dynamically scheduled CGRAs can provide

higher performance for some applications. However, the power efficiency will

then typically also be poor because more power will be consumed in the control

path. Again, the application domain determines which design option is the most

appropriate.

3.2.3

Thread-Level and Data-Level Parallelism

Another important aspect of control is the possibility to support different forms

of parallelism. Obviously, loosely-coupled CGRAs can operate in parallel with the

main CPU, but one can also try to use the CGRA resources to implement SIMD or

to run multiple threads concurrently within the CGRA.

When dynamic scheduling is implemented via distributed event-based control, as

in KressArray or PACT, implementing TLP is relatively simple and cheap. For small

enough loops of which the combined resource use fits on the CGRA, it suffices to

map independent thread controllers on different parts of the distributed control.

For architectures with centralized control, the only option to run threads in

parallel is to provide additional controllers or to extend the central controller, for

example to support parallel execution modes. While such extensions will increase

the power consumption of the controller, the newly supported modes might suit

certain code fragments better, thus saving in data path energy and configuration

fetch energy.

The TRIPS controller supports four operation modes [ 63 ] . In the first mode, all

ISs cooperate for executing one thread. In the second mode, the four rows execute

four independent threads. In the third mode, fine-grained multi-threading [ 66 ] is

supported by time-multiplexing all ISs over multiple threads. Finally, in the fourth

mode each row executes the same operation on each of its ISs, thus implementing

SIMD in a similar, fetch-power-efficient manner as is done in the two modes of the

MorphoSys design. Thus, for each loop or combination of loops in an application,

the TRIPS compiler can exploit the most suited form of parallelism.

The Raw architecture [ 69 ] is a hybrid between a many-core architecture and

a CGRA architecture in the sense that it does not feature a 2D array of ISs, but

rather a 2D array of tiles that each consist of a simple RISC processor. The tiles

are connected to each other via a mesh interconnect, and transporting data over

this interconnect to neighboring tiles does not consume more time than retrieving

data from the RF in the tile. Moreover, the control of the tiles is such that they can

operate independently or synchronized in a lock-step mode. Thus, multiple tiles can

Signal Processing Systems

Search WWH ::

Custom Search

Home