At one of Bil's first software division meetings (back when everyone fit into the cafeteria!), there
was a big debate concerning the poor performance of SunOS on a 4-MB machine. Some of
management wanted to restrict all developers to 4-MB machines so we would be more motivated to
control code inflation. The final resolution was to make 8 MB the minimum shippable configuration.
There are three vitally important aspects of performance optimization: algorithm, algorithm, and
algorithm. Seriously. Forget all of this other stuff until you have settled on the very best possible
algorithm. We can show you programs that will run faster on a uniprocessor VAX 780 than on a
64-way, 500-MHz Alpha Server, simply due to algorithm choice.
You can multithread bubblesort, and it will run twice as fast, but...
CPU Time, I/O Time, Contention, Etc.
That should be enough moralizing on the practicalities of dealing with the real world. Now let's
get serious--you're an ISV and you really want to get the best performance you can (for some
"reasonable" programming cost). First let's look at the overall system design and define our true
The primary components are the CPU, the cache, the main memory bus, main memory, the I/O
bus, and the peripherals (disks, tapes, possibly displays, networks, etc.), all of which can be
viewed generically as resources. There is a tendency to view the CPU as unique, and we often
speak of maximizing CPU usage before considering any other subsystems. However, that's not
really what we want. We really want our program to run in minimal wall-clock time. Let's
consider these subsystems.
Some programs are completely CPU-bound. They don't make great demands upon the peripherals
and have a small enough working set to be largely cache resident. A huge number of programs are
partially CPU-bound. To optimize such programs, our primary technique will be to reduce the
number of instructions executed, and our primary method of doing so will be by choosing the best
Our secondary method will be to examine our code very carefully to see if there are places where
loops can be made tighter. Sometimes we will even examine assembly code to verify the tightness
of the complied code. In all cases, we will first analyze our program, then focus our efforts on
those sections that consume the most time.
We will leave clever use of registers, optimal instruction scheduling, and the like to the compiler.
Only in the rarest of circumstances will we ever "bum" code (write assembly code). Byte copy can
be written in a single line of C code. On Sun machines, the actual library call occupies roughly
500 lines of carefully hand-optimized assembly code. It is specialized for each of the different
byte alignments, and a different version is written for each PSR. The programmer counts the
instructions, spreads data across numerous registers to maximize pipelining and multiple
instruction issues, etc. It runs upward of ten times as fast as the one line of C.
The chances of you doing anything similar is quite small. It takes a lot of effort, and it is valuable
for only a few very tight, very intensively used loops. The hassle of maintaining "bummed" code
is also quite significant. Don't do this at home!
Search WWH :