Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

This observation is critical because of the increased emphasis on integer performance since

the explosion of the World Wide Web and cloud computing starting in the mid-1990s. Indeed,

most of the market growth in the last decade—transaction processing, Web servers, and the

like—depended on integer performance, rather than floating point. As we will see in the next

section, for a realistic processor in 2011, the actual performance levels are much lower than

those shown in Figure 3.27 .

Given the difficulty of increasing the instruction rates with realistic hardware designs, de-

signers face a challenge in deciding how best to use the limited resources available on an in-

tegrated circuit. One of the most interesting trade-offs is between simpler processors with lar-

ger caches and higher clock rates versus more emphasis on instruction-level parallelism with

a slower clock and smaller caches. The following example illustrates the challenges, and in the

next chapter we will see an alternative approach to exploiting fine-grained parallelism in the

form of GPUs.

Example

Consider the following three hypothetical, but not atypical, processors, which

we run with the SPEC gcc benchmark:

1. A simple MIPS two-issue static pipe running at a clock rate of 4 GHz and

achieving a pipeline CPI of 0.8. This processor has a cache system that

yields 0.005 misses per instruction.

2. A deeply pipelined version of a two-issue MIPS processor with slightly

smaller caches and a 5 GHz clock rate. The pipeline CPI of the processor is

1.0, and the smaller caches yield 0.0055 misses per instruction on average.

3. A speculative superscalar with a 64-entry window. It achieves one-half of

the ideal issue rate measured for this window size. (Use the data in Figure

3.27 .) This processor has the smallest caches, which lead to 0.01 misses per

instruction, but it hides 25% of the miss penalty on every miss by dynamic

scheduling. This processor has a 2.5 GHz clock.

Assume that the main memory time (which sets the miss penalty) is 50 ns.

Determine the relative performance of these three processors.

Answer

First, we use the miss penalty and miss rate information to compute the con-

tribution to CPI from cache misses for each configuration. We do this with the

following formula:

We need to compute the miss penalties for each system:

Search WWH ::

Custom Search

Home