Actual data transfers are accomplished via DMA from/to disks and the network. The data is
brought into the CPU only to perform checksums; it is never written by the CPU. Checksums have
horrible data locality--they load lots of data, but use that data only once, and only for a single
addition. This means that the CPU will spend an inordinate amount of time stalled, waiting for
cache loads, but that it will do virtually no writes. (Some folks are building checksumming
hardware for exactly this purpose.) Normal programs spend more time using the data once loaded
into cache, do more writes, and generally spend less time stalled on cache misses.
NFS is constructed as a producer/consumer program. The master/slave design was rejected as
being inappropriate because of the nature of interrupt handling. When a network card gets a packet,
it issues an interrupt to one of the CPUs (interrupts are distributed in a round-robin fashion on
Sun's UE series). That CPU then runs its interrupt handler thread.
For an NFS request, the interrupt handler thread acts as the producer, building an NFS request
structure and putting that onto a list. It is important for the interrupt handler thread to complete
very quickly (as other interrupts will be blocked while it's running); thus it is not possible for that
thread to do any appreciable amount of work (such as processing the request or creating a new
thread). The consumers pull requests off the queue (exactly like our P/C example) and process
them as appropriate. Sometimes the required information will be in memory, but usually a disk
request will be required. This means that most requests will require a context switch.
Many of the original algorithms used in single-threaded NFS proved to be inappropriate for a
threaded program. They worked correctly, but suffered from excessive contention when
appropriate locking was added. A major amount of the work on multithreaded NFS was spent on
writing new algorithms that would be less contentious.
The results? An implementation that scales extremely well on upward of 24 CPUs.
Performance tuning is a very complex issue that has numerous trade-offs to be considered. Once a
performance objective and level of effort has been established, you can start looking at the
computer science issues. Even then the major issues will not be threading issues. Only after you've
done a great deal of normal optimization work will you turn your eyes toward threads. We give a
cursory overview of the areas you need to consider, and wish you the best of luck.
Search WWH :