Thread creation and synchronization time is quite low (about 1.5 ms on an 110-MHz SS4),
making it reasonable to dispatch relatively small tasks to different threads. How small can that
task be? Obviously, it must be significantly larger than the thread overhead.
Something like a 10 x 10 matrix multiply (requiring about 2000 FP ops @ 100 Mflops = 20 µs)
would be much too small to thread. By contrast, a 100 x 100 matrix multiply (2M FP ops @ 100
Mflops = 20 ms) can be threaded very effectively. If you were writing a matrix routine, your code
would check the size of the matrices and run the threaded code for larger multiplies, and run the
simple multiply in the calling thread for smaller multiplies. The exact dividing point will be about
3 ms. You can determine this empirically, and it is not terribly important to hit exactly.
One ISV we worked with was doing an EDA simulation, containing millions of 10-µs tasks. To
say the least, threading this code did not produce favorable results (it ran much slower!). They
later figured out a way of grouping the microtasks into larger tasks and threading those. The
opposite case is something like NFS, which contains hundreds of 40-ms tasks. Threading NFS
works quite well.
Dealing with Many Open Sockets
In C, C++, etc., when you want to have a large number of clients connected to your server at the
same time, you use a select() call [in Win32 it's waitForMultipleObjects()]. This
function takes a list of file descriptors as an argument and returns when there is data ready on one
of them. This allows a single thread to wait on 1000 sockets. This is a good thing because the
overhead of having 1000 threads, each waiting on a single socket (as we've done in our programs),
would be prohibitive.
Or poll(), which is actually more common now, due to its ability to handle very large numbers
of open connections.
Unfortunately, Java does not have anything similar, putting an extra constraint on the size and
scalability of your server. In Java you must have one thread devoted to each client, rendering the
producer/consumer version of a server awkward. Many of the major Java server programs actually
use JNI calls into C to make use of the select() there. There is pressure for Java to implement
The Lessons of NFS
One practical problem in evaluating the performance of threaded programs is the lack of available
data. There are simply no good analyses of real threaded programs that we can look at. (There are
analyses of strictly computational parallel programs but not of mixed usage programs, client/
server, etc.) Nobody's done it yet! Probably the best data we have comes from NFS, which we
shall look at now.
The standard metric for evaluating NFS performance is the SPEC LADDIS benchmark, which
uses a predefined mix of file operations intended to reflect realistic usage (lots of small file
information requests, some file reads, and a few file writes). As the NFS performance goes up,
LADDIS spreads the file operations over a larger number of files on more disks to eliminate trivial,
An NFS server is very demanding on all subsystems, and as the hardware in one area improves,
NFS performance will edge up until it hits a bottleneck in another. Figure 15-8 shows
configurations and performance results for a variety of systems. Notably, all of these systems are
Search WWH :