Context-Based Compression - Introduction to Data Compression

Databases Reference

In-Depth Information

correct and moved on to the next letter. If she guessed incorrectly, she was told the correct

answer and again moved on to the next letter. Here is a result from one of these experiments.

The dashes represent the letters that were correctly guessed.

Actual Text THEROOMWASNOTVERYLIGHTASMALLOBLONG

Subject Performance ____ ROO______ NOT_ V__ _ __ I______ SM____ OBL___

Notice that there is a good chance that the subject will guess the letter, especially if the

letter is at the end of a word or if the word is clear from the context. If we now represent the

original sequence by the subject performance, we would get a very different set of probabilities

for the values that each element of the sequence takes on. The probabilities are definitely much

more skewed in the second row: the “letter” _ occurs with high probability. If a mathematical

twin of the subject were available at the other end, we could send the “reduced” sentence in

the second row and have the twin go through the same guessing process to come up with the

original sequence.

In the second experiment, the subject was allowed to continue guessing until she had

guessed the correct letter and the number of guesses required to correctly predict the letter

was noted. Again, the subject guessed correctly most of the time, resulting in 1 being the

most probable number. The existence of a mathematical twin at the receiving end would allow

this skewed sequence to represent the original sequence to the receiver. Shannon used his

experiments to come up with upper and lower bounds for the English alphabet (1.3 bits per

letter and 0.6 bits per letter, respectively).

The difficulty with using these experiments is that the human subject was much better

at predicting the next letter in a sequence than any mathematical predictor we can develop.

Grammar is hypothesized to be innate to humans [ 73 ], in which case development of a predictor

as efficient as a human for language is not possible in the near future. However, the experiments

do provide an approach to compression that is useful for compression of all types of sequences,

not simply language representations.

If a sequence of symbols being encoded does not consist of independent occurrences of the

symbols, then knowledge of which symbols have occurred in the neighborhood of the symbol

being encoded will give us a much better idea of the value of the symbol being encoded. If

we know the context in which a symbol occurs, we can guess what the value of the symbol is

with a much greater likelihood of success. This is just another way of saying that, given the

context, some symbols will occur with much higher probability, or much lower, than without

the context. That is, the probability distribution given the context is more skewed. If the

context is known to both encoder and decoder, we can use this skewed distribution to perform

the encoding, thus increasing the level of compression. The decoder can use its knowledge of

the context to determine the distribution to be used for decoding. If we can somehow group

like contexts together, it is quite likely that the symbols following these contexts will be the

same, allowing for the use of some very simple and efficient compression strategies. We can

see that context can play an important role in enhancing compression, and in this chapter we

will look at several different ways of using context.

Consider the encoding of the word probability . Suppose we have already encoded the

first four letters, and we want to code the fifth letter, a . If we ignore the first four letters, the

probability of the letter a is about 0.06. If we use the information that the previous letter is b ,

Introduction to Data Compression

Search WWH ::

Custom Search

Home