Algorithm - CPU Time, I/O Time, Contention, Etc. - CPU - Multithreaded Programming with JAVA

[3]

At one of Bil's first software division meetings (back when everyone fit into the cafeteria!), there

was a big debate concerning the poor performance of SunOS on a 4-MB machine. Some of

management wanted to restrict all developers to 4-MB machines so we would be more motivated to

control code inflation. The final resolution was to make 8 MB the minimum shippable configuration.

Algorithm

There are three vitally important aspects of performance optimization: algorithm, algorithm, and

algorithm. Seriously. Forget all of this other stuff until you have settled on the very best possible

algorithm. We can show you programs that will run faster on a uniprocessor VAX 780 than on a

64-way, 500-MHz Alpha Server, simply due to algorithm choice.

You can multithread bubblesort, and it will run twice as fast, but...

CPU Time, I/O Time, Contention, Etc.

That should be enough moralizing on the practicalities of dealing with the real world. Now let's

get serious--you're an ISV and you really want to get the best performance you can (for some

"reasonable" programming cost). First let's look at the overall system design and define our true

objectives.

The primary components are the CPU, the cache, the main memory bus, main memory, the I/O

bus, and the peripherals (disks, tapes, possibly displays, networks, etc.), all of which can be

viewed generically as resources. There is a tendency to view the CPU as unique, and we often

speak of maximizing CPU usage before considering any other subsystems. However, that's not

really what we want. We really want our program to run in minimal wall-clock time. Let's

consider these subsystems.

CPU

Some programs are completely CPU-bound. They don't make great demands upon the peripherals

and have a small enough working set to be largely cache resident. A huge number of programs are

partially CPU-bound. To optimize such programs, our primary technique will be to reduce the

number of instructions executed, and our primary method of doing so will be by choosing the best

algorithms.

Our secondary method will be to examine our code very carefully to see if there are places where

loops can be made tighter. Sometimes we will even examine assembly code to verify the tightness

of the complied code. In all cases, we will first analyze our program, then focus our efforts on

those sections that consume the most time.

We will leave clever use of registers, optimal instruction scheduling, and the like to the compiler.

Only in the rarest of circumstances will we ever "bum" code (write assembly code). Byte copy can

be written in a single line of C code. On Sun machines, the actual library call occupies roughly

500 lines of carefully hand-optimized assembly code. It is specialized for each of the different

byte alignments, and a different version is written for each PSR. The programmer counts the

instructions, spreads data across numerous registers to maximize pipelining and multiple

instruction issues, etc. It runs upward of ten times as fast as the one line of C.

The chances of you doing anything similar is quite small. It takes a lot of effort, and it is valuable

for only a few very tight, very intensively used loops. The hassle of maintaining "bummed" code

is also quite significant. Don't do this at home!

Search WWH :

Custom Search