Java Reference
In-Depth Information
in one pass with an internal sort. These runs can then be merged by Merge-
sort. Standard Mergesort requires eighteen passes to process 256K records.
Using an internal sort to create initial runs of 512 records reduces this to
one initial pass to create the runs and nine merge passes to put them all
together, approximately half as many passes.
We can extend this concept to improve performance even further. Available
main memory is usually much more than one block in size. If we process larger
initial runs, then the number of passes required by Mergesort is further reduced. For
example, most modern computers can provide tens or even hundreds of megabytes
of RAM to the sorting program. If all of this memory (excepting a small amount for
buffers and local variables) is devoted to building initial runs as large as possible,
then quite large files can be processed in few passes. The next section presents a
technique for producing large runs, typically twice as large as could fit directly into
main memory.
Another way to reduce the number of passes required is to increase the number
of runs that are merged together during each pass. While the standard Mergesort
algorithm merges two runs at a time, there is no reason why merging needs to be
limited in this way. Section 8.5.3 discusses the technique of multiway merging.
Over the years, many variants on external sorting have been presented, but all
are based on the following two steps:
1. Break the file into large initial runs.
2. Merge the runs together to form a single sorted file.
8.5.2
Replacement Selection
This section treats the problem of creating initial runs as large as possible from a
disk file, assuming a fixed amount of RAM is available for processing. As men-
tioned previously, a simple approach is to allocate as much RAM as possible to a
large array, fill this array from disk, and sort the array using Quicksort. Thus, if
the size of memory available for the array is M records, then the input file can be
broken into initial runs of length M. A better approach is to use an algorithm called
replacement selection that, on average, creates runs of 2M records in length. Re-
placement selection is actually a slight variation on the Heapsort algorithm. The
fact that Heapsort is slower than Quicksort is irrelevant in this context because I/O
time will dominate the total running time of any reasonable external sorting alg-
orithm. Building longer initial runs will reduce the total I/O time required.
Replacement selection views RAM as consisting of an array of size M in ad-
dition to an input buffer and an output buffer. (Additional I/O buffers might be
desirable if the operating system supports double buffering, because replacement
selection does sequential processing on both its input and its output.) Imagine that
Search WWH ::




Custom Search