Hardware Reference
In-Depth Information
Table 3.15 Comparison of PE con fi gurations
Con fi guration of PE
item
Corse grained with
multiplier (8 bits-32 bits)
Fine grained in general
(1 bit or 2 bits)
Fine grained in this
work (2 bits)
Parallelism per unit area Moderate
Very good
Very good
Performance of addition Moderate
Very good
Very good
Performance of MAC
Moderate
Not good
Good
metal layers, and the V-ch circuit shown in Fig. 3.66 is simple and small enough;
therefore, these powerful networks have been realized with negligibly small silicon
area overhead.
3.3.1.2
PE Design
Several kinds of PE configurations are considered to be the candidates to build up a
massively parallel SIMD processor like MX-1. Table 3.15 shows the comparison of
various PE configurations. In general, a finer-grained PE configuration has an
advantage in area efficiency, because the circuit structure of each PE is simple and
small. MX-1 also utilizes this feature and maximizes the parallelism up to 2,048 in
a small silicon area of 3.1 mm 2 in 90-nm process technology. On the other hand,
conventional coarse-grained con fi gurations [ 57- 59 ] require large silicon area.
Therefore, the realized parallelism is moderate, for example, up to 128. Because a
coarse-grained PE usually equips a dedicated multiplier, both simple additions and
MAC operations can also be processed in a moderate performance. Each PE of
MX-1 is basically composed of 2-bit-grained full adders. Therefore, MX-1 gives the
best performance in the applications which are mainly composed of simple addi-
tions or subtractions, for example, pixel interpolation, SAD (sum of absolute differ-
ence), etc. In contrast to that, MAC operations cost a lot of clock cycles because
they are realized by breaking down to simple additions. Our motivation is to enhance
a MAC performance by adopting the fine-grained (2-bit) PE configuration without
reducing the massive parallelism of 2,048. However, dedicated multipliers are
difficult to be equipped in 2-bit-grained PEs employed in MX-1. Therefore, some
contrivances both in a PE circuit configuration and in an operation flow are required.
Because a MAC operation is realized by breaking down to simple additions, reduc-
ing the total number of additions by decreasing a number of generating partial prod-
ucts in a MAC operation flow is considered to be the best way. Booth's algorithm is
a well-known methodology to enhance a MAC performance by decreasing a num-
ber of generating partial products. When we look at the radix-4 Booth's encoding
table shown in Table 3.16 , three characteristic operations which are applied to the
multiplicand can be found, that is, one-bit shifting, complementing, and NOP (no
operation). Therefore, if we add some control circuits to support these operations,
we can apply the Booth's algorithm to our 2-bit-grained processor elements.
Figure 3.67 shows a circuit diagram of the PE adopted in this work. It is quite
simply configured and is mainly composed of two full adders (FAs), eight flip-flops,
and some logics. This circuit is designed to support the radix-4 Booth's algorithm
 
Search WWH ::




Custom Search