Information Technology Reference
In-Depth Information
Many programmers answer yes to these questions, but in each case the correct answer
is no, largely because of the impact of cache on running time. (1) The first search
loads the strings into cache. Memory access costs are a large component of total
time, so the second search is much faster. (2) Two factors affect search time as data
set size increases. One is that adjacent strings share longer prefixes, so the cost of
a string comparison grows, as well as the total number of comparisons. The other
is that, at some point, cache becomes ineffective because the volume of data means
that there are too many collisions, and memory access costs rise. (3) If the file is
sorted, the strings are likely to be sorted in memory, and will be prefetched during
the linear search. If the file is not sorted, each string comparison requires a random
memory access.
Intuition
Intuition is often unreliable in the context of statistics. Perceptual fallacies are well
known, such as the elementary mistake that, if the last few coin tosses were heads,
then the next is likely to be tails. Coincidences are more memorable than non-
coincidences; thus they seem more common than is in fact the case. A long ran-
dom sequence will have short subsequences that appear non-random. If a selected
subsequence has pattern, it is easy to jump to an incorrect conclusion.
A well-known example is the three-box problem. A contestant in a game is told
that one of X, Y, or Z contains a prize. The contestant chooses X but does not open it.
The host then opens Y and shows that it is empty. Should the contestant change to Z
or stay with X? Intuition says that it doesn't matter, as the probability of X containing
the prize is 1/2. Careful analysis of the alternatives shows that Z contains the prize
with probability 2/3, but when this problem was presented many mathematicians
publicly argued that 1/2 was correct.
In a variation of this error, suppose that person P has tossed two coins. Person Q
asks if one of them is heads, and P says yes. Then the intuitive estimate of the
probability that the other coin is heads is 1/2, on the basis that the status of one
coin is independent of the other, but again this is wrong. The correct response is 1/3.
The reason is that there are four possible configurations: heads-heads, heads-tails,
tails-heads, and tails-tails. Only tails-tails is eliminated by Q's question.
Intuition often seems to fail in the context of randomness and hashing. For exam-
ple, given a uniform random hash function, the likelihood of a given key hashing
to any particular one of 1,000 slots in a hash table is 1/1000. However, this does
not mean that 1,000 random keys will be allocated evenly amongst the slots; the
likelihood of this event is an infinitesimally low 1000
i
1000. Nor is it feasible for
all values to hash to the same slot; the probability of even twenty values hashing to
any one slot is absurdly small. On the other hand, statistical estimators such as the
Poisson distribution can make accurate predictions of values such as the number of
empty slots (around 368). Or an estimate can be formed with a simple simulation
program.
i
/
=
1
 
Search WWH ::




Custom Search