Memory Latency - Memory Bandwidth - I/O Latency - Contention - Multithreaded Programming with JAVA

Memory Latency

The speed at which the main memory system can fill cache requests is a major factor on the CPU

side of performance. It is not at all unusual for memory latency to occupy 50% of total CPU time.

Memory latency is difficult to identify as separate from CPU time because there are no standard

tools for measuring the amount of time it takes. As far as the OS is concerned, the entire

CPU/cache system is a single entity and is lumped into a single number--CPU time.

No measurements of cache activity are recorded, so the only means of distinguishing cache from

CPU are (1) counting instructions, (2) comparing target code to known code, and (3) using

simulators. Simulators are not generally available.[4] We'll focus on (1) and (2). Once we

determine the cache behavior of our program, we may be able to reorganize data access to

improve performance (see Reducing Cache Misses).

[4]

They're too complex to use easily, so there's no reasonable way for vendors to market them. If

you are willing to go through a lot of pain and spend big bucks for one, tell your vendor. Vendors will

do anything for money.

Memory Bandwidth

No single CPU can come vaguely close to saturating a main memory bus. At the insane rate of one

memory access per cycle, a 200-MHz Ultra could demand nearly 100 MB/s--one-twelfth of the

UPA bus's bandwidth. Of course, the CPU wouldn't have any time to do anything. Realistic

programs demand data rates closer to 50 MB/s, and 95% or more of that is serviced by the cache.

Main memory bus rates of 5 MB/sec per CPU are normal for actual programs. A UPA bus can

sustain data rates of over 1 GB/s.

It is true that a maximally configured ES10000 with 64 CPUs can easily saturate the 100-MHz

UPA crossbar switch. We don't have any clever techniques for minimizing it.

I/O Latency

Making a disk request takes a long time, about 20 ms. During this time a thread will typically go

to sleep, letting others run. Depending upon the details of the access pattern, there are a couple of

things we can do either to reduce the number of requests or to pipeline them. When the working

set is just a bit larger than main memory, we can simply buy more memory.

When the working set is enormous, we can duplicate the techniques that we'll use for optimizing

memory access (see Reducing Cache Misses). Disk accesses are easier to deal with than cache

misses because the OS does collect statistics on them and because the CPU is able to run other

threads while waiting.

Other types of I/O must simply be endured. There really is no way to optimize for asynchronous

network requests.

Contention

Sometimes one CPU will hold a lock that another CPU needs. This is normal and unavoidable, but

it may be possible to reduce the frequency. In some programs, contention can be a major factor in

reducing the amount of parallelism achieved. Contention is only an issue for multithreaded (or

multiprocess) programs, and primarily only on MP machines. Although threaded programs on

uniprocessors do experience contention, the most important cause of the contention is the speed of

other components of the system (e.g., you're holding a lock, waiting for the disk to spin).

Reducing contention is always a good thing, and is often worth a lot of extra work.

Search WWH :

Custom Search