File Processing and External Sorting - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

stored somewhere for future processing as part of another run. However, because

the heap will shrink by one element in this case, there is now a free space where the

last element of the heap used to be! Thus, replacement selection will slowly shrink

the heap and at the same time use the discarded heap space to store records for the

next run. Once the first run is complete (i.e., the heap becomes empty), the array

will be filled with records ready to be processed for the second run.

Figure 8.8

illustrates part of a run being created by replacement selection.

It should be clear that the minimum length of a run will be M records if the size

of the heap is M, because at least those records originally in the heap will be part of

the run. Under good conditions (e.g., if the input is sorted), then an arbitrarily long

run is possible. In fact, the entire file could be processed as one run. If conditions

are bad (e.g., if the input is reverse sorted), then runs of only size M result.

What is the expected length of a run generated by replacement selection? It

can be deduced from an analogy called the snowplow argument. Imagine that a

snowplow is going around a circular track during a heavy, but steady, snowstorm.

After the plow has been around at least once, snow on the track must be as follows.

Immediately behind the plow, the track is empty because it was just plowed. The

greatest level of snow on the track is immediately in front of the plow, because

this is the place least recently plowed. At any instant, there is a certain amount of

snow S on the track. Snow is constantly falling throughout the track at a steady

rate, with some snow falling “in front” of the plow and some “behind” the plow.

(On a circular track, everything is actually “in front” of the plow, but Figure 8.9

illustrates the idea.) During the next revolution of the plow, all snow S on the track

is removed, plus half of what falls. Because everything is assumed to be in steady

state, after one revolution S snow is still on the track, so 2S snow must fall during

a revolution, and 2S snow is removed during a revolution (leaving S snow behind).

At the beginning of replacement selection, nearly all values coming from the

input file are greater (i.e., “in front of the plow”) than the latest key value output for

this run, because the run's initial key values should be small. As the run progresses,

the latest key value output becomes greater and so new key values coming from the

input file are more likely to be too small (i.e., “after the plow”); such records go to

the bottom of the array. The total length of the run is expected to be twice the size of

the array. Of course, this assumes that incoming key values are evenly distributed

within the key range (in terms of the snowplow analogy, we assume that snow falls

evenly throughout the track). Sorted and reverse sorted inputs do not meet this

expectation and so change the length of the run.

8.5.3 Multiway Merging

The second stage of a typical external sorting algorithm merges the runs created by

the first stage. Assume that we have R runs to merge. If a simple two-way merge

is used, then R runs (regardless of their sizes) will require log R passes through

Search WWH ::

Custom Search

Home