Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

limit performance. To understand the basis of the media reports, let's see what Amdahl [1967]

originally said:

A fairly obvious conclusion which can be drawn at this point is that the effort expended on

achieving high parallel processing rates is wasted unless it is accompanied by achievements in se-

quential processing rates of very nearly the same magnitude. [p. 483]

One interpretation of the law was that, since portions of every program must be sequential,

there is a limit to the useful economic number of processors—say, 100. By showing linear spee-

dup with 1000 processors, this interpretation of Amdahl's law was disproved.

The basis for the statement that Amdahl's law had been “overcome” was the use of scaled

speedup , also called weak scaling . The researchers scaled the benchmark to have a dataset size

that was 1000 times larger and compared the uniprocessor and parallel execution times of the

scaled benchmark. For this particular algorithm, the sequential portion of the program was

constant independent of the size of the input, and the rest was fully parallel—hence, linear

speedup with 1000 processors. Because the running time grew faster than linear, the program

actually ran longer after scaling, even with 1000 processors.

Speedup that assumes scaling of the input is not the same as true speedup and reporting it

as if it were is misleading. Since parallel benchmarks are often run on different-sized multipro-

cessors, it is important to specify what type of application scaling is permissible and how that

scaling should be done. Although simply scaling the data size with processor count is rarely

appropriate, assuming a fixed problem size for a much larger processor count (called strong

scaling ) is often inappropriate, as well, since it is likely that users given a much larger multi-

processor would opt to run a larger or more detailed version of an application. See Appendix

I for more discussion on this important topic.

Fallacy Linear Speedups Are Needed To Make Multiprocessors Cost Effective

It is widely recognized that one of the major benefits of parallel computing is to offer a “shorter

time to solution” than the fastest uniprocessor. Many people, however, also hold the view that

parallel processors cannot be as cost effective as uniprocessors unless they can achieve per-

fect linear speedup. This argument says that, because the cost of the multiprocessor is a linear

function of the number of processors, anything less than linear speedup means that the per-

formance/cost ratio decreases, making a parallel processor less cost effective than using a uni-

processor.

The problem with this argument is that cost is not only a function of processor count but

also depends on memory, I/O, and the overhead of the system (box, power supply, intercon-

nect, and so on). It also makes less sense in the multicore era, when there are multiple pro-

cessors per chip.

The effect of including memory in the system cost was pointed out by Wood and Hill [1995] .

We use an example based on more recent data using TPC-C and SPECRate benchmarks, but

the argument could also be made with a parallel scientific application workload, which would

likely make the case even stronger.

Figure 5.32 shows the speedup for TPC-C, SPECintRate, and SPECfpRate on an IBM eServer

p5 multiprocessor configured with 4 to 64 processors. The figure shows that only TPC-C

achieves beter than linear speedup. For SPECintRate and SPECfpRate, speedup is less than

Search WWH ::

Custom Search

Home