PARALLEL COMPUTER ARCHITECTURES - Structured Computer Organization

Hardware Reference

In-Depth Information

Of course, bandwidth is not the only issue. Adding CPUs to the bus does not

increase the diameter of the interconnection network or latency in the absence of

traffic, whereas adding them to the grid does. For an n

×

n grid, the diameter is

2( n

1), so the worst (and average) case latency increases roughly as the square

root of the number of CPUs. For 400 CPUs, the diameter is 38, whereas for 1600

CPUs it is 78, so quadrupling the number of CPUs approximately doubles the

diameter and thus the average latency.

Ideally, a scalable system should maintain the same average bandwidth per

CPU and a constant average latency as more and more CPUs are added. In prac-

tice, however, keeping enough bandwidth per CPU is doable, but in all practical

designs, latency grows with size. Having it grow logarithmically, as in a hyper-

cube, is about the best that can be done.

The problem with having latency grow as the system scales up is that latency is

often fatal to performance in fine- and medium-grained applications. If a program

needs data that are not in its local memory, there is often a substantial delay in get-

ting them, and the bigger the system, the longer the delay, as we have just seen.

This problem is equally true of multiprocessors as of multicomputers, since in both

cases the physical memory is invariably divided up into far-flung modules.

As a consequence of this observation, system designers often go to great

lengths to reduce, or at least hide, the latency, using several techniques we will now

mention. The first latency-hiding technique is data replication. If copies of a

block of data can be kept at multiple locations, accesses from those locations can

be speeded up. One such replication technique is caching, in which one or more

copies of data blocks are kept close to where they are being used, as well as where

they ''belong.'' However, another strategy is to maintain multiple peer cop-

ies—copies that have equal status—as opposed to the asymmetric primary/sec-

ondary relationship used in caching. When multiple copies are maintained, in

whatever form, key issues are where the data blocks are placed, when, and by

whom. Answers range from dynamic placement on demand by the hardware, to

intentional placement at load time following compiler directives. In all cases, man-

aging consistency is an issue.

A second technique for hiding latency is prefetching . If a data item can be

fetched before it is needed, the fetching process can be overlapped with normal ex-

ecution, so that when the item is needed, it will be there. Prefetching can be auto-

matic or under program control. When a cache loads not only the word being ref-

erenced, but an entire cache line containing the word, it is gambling that the suc-

ceeding words are also likely to be needed soon.

Prefetching can also be controlled explicitly. When the compiler realizes that

it will need some data, it can put in an explicit instruction to go get them, and put

that instruction sufficiently far in advance that the data will be there in time. This

strategy requires that the compiler has a complete knowledge of the underlying

machine and its timing, as well as control over where all data are placed. Such

speculative LOAD instructions work best when it is known for sure that the data will

−

Structured Computer Organization

Search WWH ::

Custom Search

Home