Hardware Reference
In-Depth Information
Table 3.4
Earl y-stage branch instructions
Instruction
Code
Displacement
Function
BT
Label
10001001
8 bits
If (T==1)
PC = PC + 4 + disp*2
BF
Label
10001011
8 bits
If (T==0)
PC = PC + 4 + disp*2
BT/S Label
10001101
8 bits
If (T==1)
PC = PC + 4 + disp*2; execute delay slot
BF/S Label
10001111
8 bits
If (T==0)
PC = PC + 4 + disp*2; execute delay slot
BRA Label
1010
12 bits
PC = PC + 4 + disp*2; execute delay slot
BSR
Label
1011
12 bits
PR = PC + 4; PC = PC + 4 + disp*2; execute delay slot
The branch address calculation at the ID stage was the key method for the early-stage
branch and realized by the parallel operations of the calculation and the instruction
decoding. The early-stage branch was adopted to six frequently used branch instruc-
tions summarized in Table 3.4 . The calculation was 8-bit or 12-bit offset addition,
and a 1-bit check of the instruction code could identify the offset size of the six
branch instructions. The first code of the two instruction codes at the ID stage was
chosen to process if the first code was a branch; otherwise, the second code was
chosen. However, this judgment took more time than the above 1-bit check, and
some part of calculation was done before the selection by duplicating required hard-
ware to realize the parallel operations.
3.1.2.7
Performance Evaluations
The SH-4 performance was measured using a Dhrystone benchmark which was pop-
ular for evaluating integer performance of embedded processor [ 5 ] . The Dhrystone
benchmark is small enough to fit all the program and data into the caches and to
use at the beginning of the processor development. Therefore, only the processor
core architecture can be evaluated without the influence from the system level archi-
tecture, and the evaluation result can be fed back to the architecture design. On the
contrary, the system level performance cannot be measured considering cache miss
rates, external memory access throughput and latencies, and so on. The evaluation
result includes compiler performance because the Dhrystone benchmark is described
in C language. The optimizing compiler tuned up for SH-4 was used for compiling
the benchmark.
The optimizing compiler for a superscalar processor must have new optimization
items, which is not necessary for a scalar processor. For example, the distance of a
load instruction and an instruction using the loaded data must be two cycles or more
to avoid a pipeline stall. The scalar processor requires one instruction inserted between
the instructions, but the superscalar processor requires two or three instructions.
Therefore, the optimizing compiler must insert independent instructions more than
the compiler for a scalar processor.
 
Search WWH ::




Custom Search