File Processing and External Sorting - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

allocation and the smallest unit that can be read/written is a sector, which in UNIX

terminology is called a block. UNIX maintains information about file organization

in certain disk blocks called i-nodes.

A group of physically contiguous clusters from the same file is called an extent.

Ideally, all clusters making up a file will be contiguous on the disk (i.e., the file will

consist of one extent), so as to minimize seek time required to access different

portions of the file. If the disk is nearly full when a file is created, there might not

be an extent available that is large enough to hold the new file. Furthermore, if a file

grows, there might not be free space physically adjacent. Thus, a file might consist

of several extents widely spaced on the disk. The fuller the disk, and the more that

files on the disk change, the worse this file fragmentation (and the resulting seek

time) becomes. File fragmentation leads to a noticeable degradation in performance

as additional seeks are required to access data.

Another type of problem arises when the file's logical record size does not

match the sector size. If the sector size is not a multiple of the record size (or

vice versa), records will not fit evenly within a sector. For example, a sector might

be 2048 bytes long, and a logical record 100 bytes. This leaves room to store

20 records with 48 bytes left over. Either the extra space is wasted, or else records

are allowed to cross sector boundaries. If a record crosses a sector boundary, two

disk accesses might be required to read it. If the space is left empty instead, such

wasted space is called internal fragmentation.

A second example of internal fragmentation occurs at cluster boundaries. Files

whose size is not an even multiple of the cluster size must waste some space at

the end of the last cluster. The worst case will occur when file size modulo cluster

size is one (for example, a file of 4097 bytes and a cluster of 4096 bytes). Thus,

cluster size is a tradeoff between large files processed sequentially (where a large

cluster size is desirable to minimize seeks) and small files (where small clusters are

desirable to minimize wasted storage).

Every disk drive organization requires that some disk space be used to organize

the sectors, clusters, and so forth. The layout of sectors within a track is illustrated

by Figure 8.4. Typical information that must be stored on the disk itself includes

the File Allocation Table, sector headers that contain address marks and informa-

tion about the condition (whether usable or not) for each sector, and gaps between

sectors. The sector header also contains error detection codes to help verify that the

data have not been corrupted. This is why most disk drives have a “nominal” size

that is greater than the actual amount of user data that can be stored on the drive.

The difference is the amount of space required to organize the information on the

disk. Even more space will be lost due to fragmentation.

Search WWH ::

Custom Search

Home