File Processing and External Sorting - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

The moral is that, with a single disk drive, there often is no such thing as effi-

cient sequential processing of a data file. Thus, a sorting algorithm might be more

efficient if it performs a smaller number of non-sequential disk operations rather

than a larger number of logically sequential disk operations that require a large

number of seeks in practice.

As mentioned previously, the record size might be quite large compared to the

size of the key. For example, payroll entries for a large business might each store

hundreds of bytes of information including the name, ID, address, and job title for

each employee. The sort key might be the ID number, requiring only a few bytes.

The simplest sorting algorithm might be to process such records as a whole, reading

the entire record whenever it is processed. However, this will greatly increase the

amount of I/O required, because only a relatively few records will fit into a single

disk block. Another alternative is to do a key sort. Under this method, the keys are

all read and stored together in an index file, where each key is stored along with a

pointer indicating the position of the corresponding record in the original data file.

The key and pointer combination should be substantially smaller than the size of

the original record; thus, the index file will be much smaller than the complete data

file. The index file will then be sorted, requiring much less I/O because the index

records are smaller than the complete records.

Once the index file is sorted, it is possible to reorder the records in the original

database file. This is typically not done for two reasons. First, reading the records

in sorted order from the record file requires a random access for each record. This

can take a substantial amount of time and is only of value if the complete collection

of records needs to be viewed or processed in sorted order (as opposed to a search

for selected records). Second, database systems typically allow searches to be done

on multiple keys. For example, today's processing might be done in order of ID

numbers. Tomorrow, the boss might want information sorted by salary. Thus, there

might be no single “sorted” order for the full record. Instead, multiple index files

are often maintained, one for each sort key.

These ideas are explored further in

Chapter 10.

8.5.1

Simple Approaches to External Sorting

If your operating system supports virtual memory, the simplest “external” sort is

to read the entire file into virtual memory and run an internal sorting method such

as Quicksort. This approach allows the virtual memory manager to use its normal

buffer pool mechanism to control disk accesses. Unfortunately, this might not al-

ways be a viable option. One potential drawback is that the size of virtual memory

is usually limited to something much smaller than the disk space available. Thus,

your input file might not fit into virtual memory. Limited virtual memory can be

overcome by adapting an internal sorting method to make use of your own buffer

pool.

Search WWH ::

Custom Search

Home