Hardware Reference
In-Depth Information
Of course, bandwidth is not the only issue. Adding CPUs to the bus does not
increase the diameter of the interconnection network or latency in the absence of
traffic, whereas adding them to the grid does. For an n
×
n grid, the diameter is
2( n
1), so the worst (and average) case latency increases roughly as the square
root of the number of CPUs. For 400 CPUs, the diameter is 38, whereas for 1600
CPUs it is 78, so quadrupling the number of CPUs approximately doubles the
diameter and thus the average latency.
Ideally, a scalable system should maintain the same average bandwidth per
CPU and a constant average latency as more and more CPUs are added. In prac-
tice, however, keeping enough bandwidth per CPU is doable, but in all practical
designs, latency grows with size. Having it grow logarithmically, as in a hyper-
cube, is about the best that can be done.
The problem with having latency grow as the system scales up is that latency is
often fatal to performance in fine- and medium-grained applications. If a program
needs data that are not in its local memory, there is often a substantial delay in get-
ting them, and the bigger the system, the longer the delay, as we have just seen.
This problem is equally true of multiprocessors as of multicomputers, since in both
cases the physical memory is invariably divided up into far-flung modules.
As a consequence of this observation, system designers often go to great
lengths to reduce, or at least hide, the latency, using several techniques we will now
mention. The first latency-hiding technique is data replication. If copies of a
block of data can be kept at multiple locations, accesses from those locations can
be speeded up. One such replication technique is caching, in which one or more
copies of data blocks are kept close to where they are being used, as well as where
they ''belong.'' However, another strategy is to maintain multiple peer cop-
ies—copies that have equal status—as opposed to the asymmetric primary/sec-
ondary relationship used in caching. When multiple copies are maintained, in
whatever form, key issues are where the data blocks are placed, when, and by
whom. Answers range from dynamic placement on demand by the hardware, to
intentional placement at load time following compiler directives. In all cases, man-
aging consistency is an issue.
A second technique for hiding latency is prefetching . If a data item can be
fetched before it is needed, the fetching process can be overlapped with normal ex-
ecution, so that when the item is needed, it will be there. Prefetching can be auto-
matic or under program control. When a cache loads not only the word being ref-
erenced, but an entire cache line containing the word, it is gambling that the suc-
ceeding words are also likely to be needed soon.
Prefetching can also be controlled explicitly. When the compiler realizes that
it will need some data, it can put in an explicit instruction to go get them, and put
that instruction sufficiently far in advance that the data will be there in time. This
strategy requires that the compiler has a complete knowledge of the underlying
machine and its timing, as well as control over where all data are placed. Such
speculative LOAD instructions work best when it is known for sure that the data will
 
Search WWH ::




Custom Search