Digital Signal Processing Reference
In-Depth Information
IC p of each phase, and in terms of the time overhead involved in switching between
the phases t p p + 1 as follows:
1
performance =
IC p
IPC p ·
= p P
f p +
.
execution time
t p p + 1
(1)
The operating frequencies f p cannot be increased infinitely because of power-
efficiency reasons. Alternatively, a designer can increase the performance by
designing or selecting a system that can execute code at higher IPCs. In a power-
efficient architecture, a high IPC is reached for the most important phases l
P ,
with L typically consisting of the compute-intensive inner loops, while limiting
their instruction count IC l and reaching a sufficiently high, but still power-efficient
frequency f l . Furthermore, the time overhead t p p + 1 as well as the corresponding
energy overhead of switching between the execution modes of consecutive phases
should be minimized if such switching happens frequently. Note that such switching
only happens on hardware that supports multiple execution modes in support of
phases with different characteristics.
Course-Grained Reconfigurable Array (CGRA) accelerators aim for these goals
for the inner loops found in many digital signal processing (DSP) domains, includ-
ing multimedia and Software-Defined Radio (SDR) applications. Such applications
have traditionally employed Very Long Instruction Word (VLIW) architectures such
as the TriMedia 3270 [ 74 ] andtheTIC64[ 70 ] , Application-Specific Integrated
Circuits (ASICs), and Application-Specific Instruction Processors (ASIPs). To a
large degree, the reasons for running these applications on VLIW processors also
apply for CGRAs. First of all, a large fraction of the computation time is spent
in manifest nested loops that perform computations on arrays of data and that
can, possibly through compiler transformations, provide a lot of Instruction-Level
Parallelism (ILP). Secondly, most of those inner loops are relatively simple. When
the loops include conditional statements, this can be implement by means of
predication [ 45 ] instead of with complex control flow. Furthermore, none or very
few loops contain multiple exits or continuation points in the form of, e.g., break
or continue statements as in the C-language. Moreover, after inlining the loops
are free of function calls. Finally, the loops are not regular or homogeneous enough
to benefit from vector computing, like on the EVP [ 6 ] oronArdbeg [ 75 ] . When
there is enough regularity and Data-Level Parallelism (DLP) in the loops of an
application, vector computing can typically exploit it more efficiently than what
can be achieved by converting the DLP into ILP and exploiting that on a CGRA.
So in short, CGRAs (with limited DLP support) are ideally suited for applications
of which time-consuming parts have manifest behavior, large amounts of ILP and
limited amounts of DLP.
In the remainder of this chapter, Sect. 2 presents the fundamental properties
of CGRAs. Section 3 gives an overview of the design options for CGRAs. This
overview help designers in evaluating whether or not CGRAs are suited for their
applications and their design requirements, and if so, which CGRA designs are
most suited. After the overview, Sect. 4 presents a case study on the ADRES CGRA
L
 
Search WWH ::




Custom Search