Java Reference
In-Depth Information
File
Sort 1
Sort 2
Sort 3
Size
Memory size (in blocks)
Memory size (in blocks)
(Mb)
2
4
16
256
2
4
16
1
0.61
0.27
0.24
0.19
0.10
0.21
0.15
0.13
4,864
2,048
1,792
1,280
256
2,048
1,024
512
4
2.56
1.30
1.19
0.96
0.61
1.15
0.68
0.66*
21,504
10,240
9,216
7,168
3,072
10,240
5,120
2,048
16
11.28
6.12
5.63
4.78
3.36
5.42
3.19
3.10
94,208
49,152
45,056
36,864
20,480
49,152
24,516
12,288
256
220.39
132.47
123.68
110.01
86.66
115.73
69.31
68.71
1,769K
1,048K
983K
852K
589K
1,049K
524K
262K
Figure8.11 A comparison of three external sorts on a collection of small records
for files of various sizes. Each entry in the table shows time in seconds and total
number of blocks read and written by the program. File sizes are in Megabytes.
For the third sorting algorithm, on a file size of 4MB, the time and blocks shown
in the last column are for a 32-way merge (marked with an asterisk). 32 is used
instead of 16 because 32 is a root of the number of blocks in the file (while 16 is
not), thus allowing the same number of runs to be merged at every pass.
in one merge pass for a fixed-size working memory; a smaller block size or larger
working memory would increase the file size that can be processed in one merge
pass. Two merge passes allow much bigger files to be processed. With 0.5MB of
working memory and 4KB blocks, a file of size 16 gigabytes could be processed in
two merge passes, which is big enough for most applications. Thus, this is a very
effective algorithm for single disk drive external sorting.
Figure 8.11 shows a comparison of the running time to sort various-sized files
for the following implementations: (1) standard Mergesort with two input runs and
two output runs, (2) two-way Mergesort with large initial runs (limited by the size
of available memory), and (3) R-way Mergesort performed after generating large
initial runs. In each case, the file was composed of a series of four-byte records
(a two-byte key and a two-byte data value), or 256K records per megabyte of file
size. We can see from this table that using even a modest memory size (two blocks)
to create initial runs results in a tremendous savings in time. Doing 4-way merges
of the runs provides another considerable speedup, however large-scale multi-way
merges for R beyond about 4 or 8 runs does not help much because a lot of time is
spent determining which is the next smallest element among the R runs.
We see from this experiment that building large initial runs reduces the running
time to slightly more than one third that of standard Mergesort, depending on file
and memory sizes. Using a multiway merge further cuts the time nearly in half.
In summary, a good external sorting algorithm will seek to do the following:
• Make the initial runs as long as possible.
• At all stages, overlap input, processing, and output as much as possible.
 
Search WWH ::




Custom Search