Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems

Hardware Reference

In-Depth Information

Table 3.4

Earl y-stage branch instructions

Instruction

Code

Displacement

Function

BT

Label

10001001

8 bits

If (T==1)

PC = PC + 4 + disp*2

BF

Label

10001011

8 bits

If (T==0)

PC = PC + 4 + disp*2

BT/S Label

10001101

8 bits

If (T==1)

PC = PC + 4 + disp*2; execute delay slot

BF/S Label

10001111

8 bits

If (T==0)

PC = PC + 4 + disp*2; execute delay slot

BRA Label

1010

12 bits

PC = PC + 4 + disp*2; execute delay slot

BSR

Label

1011

12 bits

PR = PC + 4; PC = PC + 4 + disp*2; execute delay slot

The branch address calculation at the ID stage was the key method for the early-stage

branch and realized by the parallel operations of the calculation and the instruction

decoding. The early-stage branch was adopted to six frequently used branch instruc-

tions summarized in Table 3.4 . The calculation was 8-bit or 12-bit offset addition,

and a 1-bit check of the instruction code could identify the offset size of the six

branch instructions. The first code of the two instruction codes at the ID stage was

chosen to process if the first code was a branch; otherwise, the second code was

chosen. However, this judgment took more time than the above 1-bit check, and

some part of calculation was done before the selection by duplicating required hard-

ware to realize the parallel operations.

3.1.2.7

Performance Evaluations

The SH-4 performance was measured using a Dhrystone benchmark which was pop-

ular for evaluating integer performance of embedded processor [ 5 ] . The Dhrystone

benchmark is small enough to fit all the program and data into the caches and to

use at the beginning of the processor development. Therefore, only the processor

core architecture can be evaluated without the influence from the system level archi-

tecture, and the evaluation result can be fed back to the architecture design. On the

contrary, the system level performance cannot be measured considering cache miss

rates, external memory access throughput and latencies, and so on. The evaluation

result includes compiler performance because the Dhrystone benchmark is described

in C language. The optimizing compiler tuned up for SH-4 was used for compiling

the benchmark.

The optimizing compiler for a superscalar processor must have new optimization

items, which is not necessary for a scalar processor. For example, the distance of a

load instruction and an instruction using the loaded data must be two cycles or more

to avoid a pipeline stall. The scalar processor requires one instruction inserted between

the instructions, but the superscalar processor requires two or three instructions.

Therefore, the optimizing compiler must insert independent instructions more than

the compiler for a scalar processor.

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home