Digital Signal Processing Reference
In-Depth Information
We can conclude by noting that, as in other architecture paradigms such as
VLIW processing or superscalar out-of-order execution, dynamically scheduled
CGRAs can deliver higher performance than statically scheduled ones for control-
intensive code with unpredictable behavior. On dynamically scheduled CGRAs the
code path that gets executed in an iteration determines the execution time of that
iteration, whereas on statically scheduled CGRAs, the combination of all possible
execution paths (including the slowest path which might be executed infrequently)
determines the execution time. Thus, dynamically scheduled CGRAs can provide
higher performance for some applications. However, the power efficiency will
then typically also be poor because more power will be consumed in the control
path. Again, the application domain determines which design option is the most
appropriate.
3.2.3
Thread-Level and Data-Level Parallelism
Another important aspect of control is the possibility to support different forms
of parallelism. Obviously, loosely-coupled CGRAs can operate in parallel with the
main CPU, but one can also try to use the CGRA resources to implement SIMD or
to run multiple threads concurrently within the CGRA.
When dynamic scheduling is implemented via distributed event-based control, as
in KressArray or PACT, implementing TLP is relatively simple and cheap. For small
enough loops of which the combined resource use fits on the CGRA, it suffices to
map independent thread controllers on different parts of the distributed control.
For architectures with centralized control, the only option to run threads in
parallel is to provide additional controllers or to extend the central controller, for
example to support parallel execution modes. While such extensions will increase
the power consumption of the controller, the newly supported modes might suit
certain code fragments better, thus saving in data path energy and configuration
fetch energy.
The TRIPS controller supports four operation modes [ 63 ] . In the first mode, all
ISs cooperate for executing one thread. In the second mode, the four rows execute
four independent threads. In the third mode, fine-grained multi-threading [ 66 ] is
supported by time-multiplexing all ISs over multiple threads. Finally, in the fourth
mode each row executes the same operation on each of its ISs, thus implementing
SIMD in a similar, fetch-power-efficient manner as is done in the two modes of the
MorphoSys design. Thus, for each loop or combination of loops in an application,
the TRIPS compiler can exploit the most suited form of parallelism.
The Raw architecture [ 69 ] is a hybrid between a many-core architecture and
a CGRA architecture in the sense that it does not feature a 2D array of ISs, but
rather a 2D array of tiles that each consist of a simple RISC processor. The tiles
are connected to each other via a mesh interconnect, and transporting data over
this interconnect to neighboring tiles does not consume more time than retrieving
data from the RF in the tile. Moreover, the control of the tiles is such that they can
operate independently or synchronized in a lock-step mode. Thus, multiple tiles can
Search WWH ::




Custom Search