Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems

Hardware Reference

In-Depth Information

Table 3.15 Comparison of PE con fi gurations

Con fi guration of PE

item

Corse grained with

multiplier (8 bits-32 bits)

Fine grained in general

(1 bit or 2 bits)

Fine grained in this

work (2 bits)

Parallelism per unit area Moderate

Very good

Performance of addition Moderate

Very good

Performance of MAC

Moderate

Not good

Good

metal layers, and the V-ch circuit shown in Fig. 3.66 is simple and small enough;

therefore, these powerful networks have been realized with negligibly small silicon

area overhead.

3.3.1.2

PE Design

Several kinds of PE configurations are considered to be the candidates to build up a

massively parallel SIMD processor like MX-1. Table 3.15 shows the comparison of

various PE configurations. In general, a finer-grained PE configuration has an

advantage in area efficiency, because the circuit structure of each PE is simple and

small. MX-1 also utilizes this feature and maximizes the parallelism up to 2,048 in

a small silicon area of 3.1 mm 2 in 90-nm process technology. On the other hand,

conventional coarse-grained con fi gurations [ 57- 59 ] require large silicon area.

Therefore, the realized parallelism is moderate, for example, up to 128. Because a

coarse-grained PE usually equips a dedicated multiplier, both simple additions and

MAC operations can also be processed in a moderate performance. Each PE of

MX-1 is basically composed of 2-bit-grained full adders. Therefore, MX-1 gives the

best performance in the applications which are mainly composed of simple addi-

tions or subtractions, for example, pixel interpolation, SAD (sum of absolute differ-

ence), etc. In contrast to that, MAC operations cost a lot of clock cycles because

they are realized by breaking down to simple additions. Our motivation is to enhance

a MAC performance by adopting the fine-grained (2-bit) PE configuration without

reducing the massive parallelism of 2,048. However, dedicated multipliers are

difficult to be equipped in 2-bit-grained PEs employed in MX-1. Therefore, some

contrivances both in a PE circuit configuration and in an operation flow are required.

Because a MAC operation is realized by breaking down to simple additions, reduc-

ing the total number of additions by decreasing a number of generating partial prod-

ucts in a MAC operation flow is considered to be the best way. Booth's algorithm is

a well-known methodology to enhance a MAC performance by decreasing a num-

ber of generating partial products. When we look at the radix-4 Booth's encoding

table shown in Table 3.16 , three characteristic operations which are applied to the

multiplicand can be found, that is, one-bit shifting, complementing, and NOP (no

operation). Therefore, if we add some control circuits to support these operations,

we can apply the Booth's algorithm to our 2-bit-grained processor elements.

Figure 3.67 shows a circuit diagram of the PE adopted in this work. It is quite

simply configured and is mainly composed of two full adders (FAs), eight flip-flops,

and some logics. This circuit is designed to support the radix-4 Booth's algorithm

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home