The speed at which the main memory system can fill cache requests is a major factor on the CPU
side of performance. It is not at all unusual for memory latency to occupy 50% of total CPU time.
Memory latency is difficult to identify as separate from CPU time because there are no standard
tools for measuring the amount of time it takes. As far as the OS is concerned, the entire
CPU/cache system is a single entity and is lumped into a single number--CPU time.
No measurements of cache activity are recorded, so the only means of distinguishing cache from
CPU are (1) counting instructions, (2) comparing target code to known code, and (3) using
simulators. Simulators are not generally available. We'll focus on (1) and (2). Once we
determine the cache behavior of our program, we may be able to reorganize data access to
They're too complex to use easily, so there's no reasonable way for vendors to market them. If
you are willing to go through a lot of pain and spend big bucks for one, tell your vendor. Vendors will
do anything for money.
No single CPU can come vaguely close to saturating a main memory bus. At the insane rate of one
memory access per cycle, a 200-MHz Ultra could demand nearly 100 MB/s--one-twelfth of the
UPA bus's bandwidth. Of course, the CPU wouldn't have any time to do anything. Realistic
programs demand data rates closer to 50 MB/s, and 95% or more of that is serviced by the cache.
Main memory bus rates of 5 MB/sec per CPU are normal for actual programs. A UPA bus can
sustain data rates of over 1 GB/s.
It is true that a maximally configured ES10000 with 64 CPUs can easily saturate the 100-MHz
UPA crossbar switch. We don't have any clever techniques for minimizing it.
Making a disk request takes a long time, about 20 ms. During this time a thread will typically go
to sleep, letting others run. Depending upon the details of the access pattern, there are a couple of
things we can do either to reduce the number of requests or to pipeline them. When the working
set is just a bit larger than main memory, we can simply buy more memory.
When the working set is enormous, we can duplicate the techniques that we'll use for optimizing
misses because the OS does collect statistics on them and because the CPU is able to run other
threads while waiting.
Other types of I/O must simply be endured. There really is no way to optimize for asynchronous
Sometimes one CPU will hold a lock that another CPU needs. This is normal and unavoidable, but
it may be possible to reduce the frequency. In some programs, contention can be a major factor in
reducing the amount of parallelism achieved. Contention is only an issue for multithreaded (or
multiprocess) programs, and primarily only on MP machines. Although threaded programs on
uniprocessors do experience contention, the most important cause of the contention is the speed of
other components of the system (e.g., you're holding a lock, waiting for the disk to spin).
Reducing contention is always a good thing, and is often worth a lot of extra work.
Search WWH :