Databases Reference
In-Depth Information
correct and moved on to the next letter. If she guessed incorrectly, she was told the correct
answer and again moved on to the next letter. Here is a result from one of these experiments.
The dashes represent the letters that were correctly guessed.
Actual Text THEROOMWASNOTVERYLIGHTASMALLOBLONG
Subject Performance ____ ROO______ NOT_ V__ _ __ I______ SM____ OBL___
Notice that there is a good chance that the subject will guess the letter, especially if the
letter is at the end of a word or if the word is clear from the context. If we now represent the
original sequence by the subject performance, we would get a very different set of probabilities
for the values that each element of the sequence takes on. The probabilities are definitely much
more skewed in the second row: the “letter” _ occurs with high probability. If a mathematical
twin of the subject were available at the other end, we could send the “reduced” sentence in
the second row and have the twin go through the same guessing process to come up with the
original sequence.
In the second experiment, the subject was allowed to continue guessing until she had
guessed the correct letter and the number of guesses required to correctly predict the letter
was noted. Again, the subject guessed correctly most of the time, resulting in 1 being the
most probable number. The existence of a mathematical twin at the receiving end would allow
this skewed sequence to represent the original sequence to the receiver. Shannon used his
experiments to come up with upper and lower bounds for the English alphabet (1.3 bits per
letter and 0.6 bits per letter, respectively).
The difficulty with using these experiments is that the human subject was much better
at predicting the next letter in a sequence than any mathematical predictor we can develop.
Grammar is hypothesized to be innate to humans [ 73 ], in which case development of a predictor
as efficient as a human for language is not possible in the near future. However, the experiments
do provide an approach to compression that is useful for compression of all types of sequences,
not simply language representations.
If a sequence of symbols being encoded does not consist of independent occurrences of the
symbols, then knowledge of which symbols have occurred in the neighborhood of the symbol
being encoded will give us a much better idea of the value of the symbol being encoded. If
we know the context in which a symbol occurs, we can guess what the value of the symbol is
with a much greater likelihood of success. This is just another way of saying that, given the
context, some symbols will occur with much higher probability, or much lower, than without
the context. That is, the probability distribution given the context is more skewed. If the
context is known to both encoder and decoder, we can use this skewed distribution to perform
the encoding, thus increasing the level of compression. The decoder can use its knowledge of
the context to determine the distribution to be used for decoding. If we can somehow group
like contexts together, it is quite likely that the symbols following these contexts will be the
same, allowing for the use of some very simple and efficient compression strategies. We can
see that context can play an important role in enhancing compression, and in this chapter we
will look at several different ways of using context.
Consider the encoding of the word probability . Suppose we have already encoded the
first four letters, and we want to code the fifth letter, a . If we ignore the first four letters, the
probability of the letter a is about 0.06. If we use the information that the previous letter is b ,
 
Search WWH ::




Custom Search