Statistical Principles - Writing for Computer Science

Information Technology Reference

In-Depth Information

only two-thirds of the time was required. The explanation, it developed, was that the

smaller collection was kept with its index on a single disk drive while, for the larger

collection, the index had been placed on a separate drive due to space constraints. In

the case of the smaller collection, the accesses to data and index had been interfering

with each other, causing disk access delays that were largely absent when two drives

were used. 7 Another aspect of research that this incident illustrates is the need for

inventive experiments. Identifying a range of ways to alter the behaviour of a system,

then measuring their effect, can lead to unexpected revelations.

Compilers are a substantial cause of systematic error. Versions of the same com-

piler can vary dramatically for particular programs, as can system software such as

file managers. After an upgrade from one version of Linux to the next release, the

time to run an external sort routine we were testing rose from 3,500 s to 7,500 s.

However, a code profiler showed that some individual routines were running more

quickly.

In another experiment, we were troubled by a random error. Sometimes a run

completed in around an hour, but often took an hour and a half. Intermediate times

were not observed. The experiment involved a complex interaction between stored

data, a memory buffer, and temporary files, so some variability was reasonable, but we

expected a spread of results—not two widely separated clusters. The explanation was

the screen lock; while earlier experiments had been run on a server, we had recently

moved to a high-performance desktop PC. The slower runs had been overnight, when

the PC was not in use.

Test your intuition on the following example. Suppose you write a program for

searching a large file of randomly selected strings. The first stage of the program

reads the set of strings into memory, creates an array of pointers (one to the start of

each string), then sorts the pointers so that the strings are in lexicographic order. The

program then reads a query string and uses binary search in the array to find whether

the query is present in the original string set.

1. If two searches in a row are for the same string, do both searches take about the

same length of time?

2. Suppose the number of strings in the file is increased by a factor of sixteen. Do

searches then take about four times as long?

3. Suppose linear search is used instead of binary search. If the original file is already

sorted, are searches the same speed as for an unsorted file? (Recall that the pointers

to the strings are sorted after the strings are read in.)

7 Problems of this kind, and their solutions, can be highly illuminating. In this case, we discovered

that disk seek times were a major component of total costs, accounting for around half of all elapsed

time. Had we been explicitly investigating the significance of seek costs, we might not have thought

of this experiment.

In an experiment undertaken by colleagues of mine in the mid 2000s, they found that the time

required for disk-based algorithms could vary by as much as 15 %, depending on whether the data

was stored near the inside or the outside of the disk platter. As they noted in their paper, to do such

experiments “you may need to become intimately aquainted with the behaviour of your disk drive”.

Writing for Computer Science

Search WWH ::

Custom Search

Home