File Processing and External Sorting - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

in one pass with an internal sort. These runs can then be merged by Merge-

sort. Standard Mergesort requires eighteen passes to process 256K records.

Using an internal sort to create initial runs of 512 records reduces this to

one initial pass to create the runs and nine merge passes to put them all

together, approximately half as many passes.

We can extend this concept to improve performance even further. Available

main memory is usually much more than one block in size. If we process larger

initial runs, then the number of passes required by Mergesort is further reduced. For

example, most modern computers can provide tens or even hundreds of megabytes

of RAM to the sorting program. If all of this memory (excepting a small amount for

buffers and local variables) is devoted to building initial runs as large as possible,

then quite large files can be processed in few passes. The next section presents a

technique for producing large runs, typically twice as large as could fit directly into

main memory.

Another way to reduce the number of passes required is to increase the number

of runs that are merged together during each pass. While the standard Mergesort

algorithm merges two runs at a time, there is no reason why merging needs to be

limited in this way. Section 8.5.3 discusses the technique of multiway merging.

Over the years, many variants on external sorting have been presented, but all

are based on the following two steps:

1. Break the file into large initial runs.

2. Merge the runs together to form a single sorted file.

8.5.2

Replacement Selection

This section treats the problem of creating initial runs as large as possible from a

disk file, assuming a fixed amount of RAM is available for processing. As men-

tioned previously, a simple approach is to allocate as much RAM as possible to a

large array, fill this array from disk, and sort the array using Quicksort. Thus, if

the size of memory available for the array is M records, then the input file can be

broken into initial runs of length M. A better approach is to use an algorithm called

replacement selection that, on average, creates runs of 2M records in length. Re-

placement selection is actually a slight variation on the Heapsort algorithm. The

fact that Heapsort is slower than Quicksort is irrelevant in this context because I/O

time will dominate the total running time of any reasonable external sorting alg-

orithm. Building longer initial runs will reduce the total I/O time required.

Replacement selection views RAM as consisting of an array of size M in ad-

dition to an input buffer and an output buffer. (Additional I/O buffers might be

desirable if the operating system supports double buffering, because replacement

selection does sequential processing on both its input and its output.) Imagine that

Search WWH ::

Custom Search

Home