Statistical Principles - Writing for Computer Science

Information Technology Reference

In-Depth Information

Many programmers answer yes to these questions, but in each case the correct answer

is no, largely because of the impact of cache on running time. (1) The first search

loads the strings into cache. Memory access costs are a large component of total

time, so the second search is much faster. (2) Two factors affect search time as data

set size increases. One is that adjacent strings share longer prefixes, so the cost of

a string comparison grows, as well as the total number of comparisons. The other

is that, at some point, cache becomes ineffective because the volume of data means

that there are too many collisions, and memory access costs rise. (3) If the file is

sorted, the strings are likely to be sorted in memory, and will be prefetched during

the linear search. If the file is not sorted, each string comparison requires a random

memory access.

Intuition

Intuition is often unreliable in the context of statistics. Perceptual fallacies are well

known, such as the elementary mistake that, if the last few coin tosses were heads,

then the next is likely to be tails. Coincidences are more memorable than non-

coincidences; thus they seem more common than is in fact the case. A long ran-

dom sequence will have short subsequences that appear non-random. If a selected

subsequence has pattern, it is easy to jump to an incorrect conclusion.

A well-known example is the three-box problem. A contestant in a game is told

that one of X, Y, or Z contains a prize. The contestant chooses X but does not open it.

The host then opens Y and shows that it is empty. Should the contestant change to Z

or stay with X? Intuition says that it doesn't matter, as the probability of X containing

the prize is 1/2. Careful analysis of the alternatives shows that Z contains the prize

with probability 2/3, but when this problem was presented many mathematicians

publicly argued that 1/2 was correct.

In a variation of this error, suppose that person P has tossed two coins. Person Q

asks if one of them is heads, and P says yes. Then the intuitive estimate of the

probability that the other coin is heads is 1/2, on the basis that the status of one

coin is independent of the other, but again this is wrong. The correct response is 1/3.

The reason is that there are four possible configurations: heads-heads, heads-tails,

tails-heads, and tails-tails. Only tails-tails is eliminated by Q's question.

Intuition often seems to fail in the context of randomness and hashing. For exam-

ple, given a uniform random hash function, the likelihood of a given key hashing

to any particular one of 1,000 slots in a hash table is 1/1000. However, this does

not mean that 1,000 random keys will be allocated evenly amongst the slots; the

likelihood of this event is an infinitesimally low 1000

i

1000. Nor is it feasible for

all values to hash to the same slot; the probability of even twenty values hashing to

any one slot is absurdly small. On the other hand, statistical estimators such as the

Poisson distribution can make accurate predictions of values such as the number of

empty slots (around 368). Or an estimate can be formed with a simple simulation

program.

i

/

=

1

Writing for Computer Science

Search WWH ::

Custom Search

Home