Mathematical Preliminaries for Lossless Compression - Introduction to Data Compression

Databases Reference

In-Depth Information

probability of g would increase substantially. As we increase the context size (go from n to

in to din and so on), the probability of the alphabet becomes more and more skewed, which

results in lower entropy.

Shannon used a second-order model for English text consisting of the 26 letters and one

space to obtain an entropy of 3.1 bits/letter [ 4 ]. Using a model where the output symbols

were words rather than letters brought down the entropy to 2.4 bits/letter. Shannon then used

predictions generated by people (rather than statistical models) to estimate the upper and lower

bounds on the entropy of the second-order model. For the case where the subjects knew the

100 previous letters, he estimated these bounds to be 1.3 and 0.6 bits/letter, respectively.

The longer the context, the better its predictive value. However, if we were to store the

probability model with respect to all contexts of a given length, the number of contexts would

grow exponentially with the length of context. Furthermore, given that the source imposes

some structure on its output, many of these contexts may correspond to strings that would

never occur in practice. Consider a context model of order four (the context is determined by

the last four symbols). If we take an alphabet size of 95, the possible number of contexts is

95 4 —more than 81 million!

This problem is further exacerbated by the fact that different realizations of the source

output may vary considerably in terms of repeating patterns. Therefore, context modeling

in text compression schemes tends to be an adaptive strategy in which the probabilities for

different symbols in the different contexts are updated as they are encountered. However, this

means that we will often encounter symbols that have not been encountered before for any of

the given contexts (this is known as the zero frequency problem ). The larger the context, the

more often this will happen. This problem could be resolved by sending a code to indicate

that the following symbol was being encountered for the first time, followed by a prearranged

code for that symbol. This would significantly increase the length of the code for the symbol

on its first occurrence (in the given context). However, if this situation did not occur too often,

the overhead associated with such occurrences would be small compared to the total number

of bits used to encode the output of the source. Unfortunately, in context-based encoding, the

zero frequency problem is encountered often enough for overhead to be a problem, especially

for longer contexts. Solutions to this problem are presented by the ppm (prediction with partial

match) algorithm and its variants (described in detail in Chapter 6).

Briefly, the ppm algorithm first attempts to find if the symbol to be encoded has a nonzero

probability with respect to the maximum context length. If this is so, the symbol is encoded

and transmitted. If not, an escape symbol is transmitted, the context size is reduced by one,

and the process is repeated. This procedure is repeated until a context is found with respect to

which the symbol has a nonzero probability. To guarantee that this process converges, a null

context is always included with respect to which all symbols have equal probability. Initially,

only the shorter contexts are likely to be used. However, as more and more of the source output

is processed, the longer contexts, which offer better prediction, will be used more often. The

probability of the escape symbol can be computed in a number of different ways leading to

different implementations [ 1 ].

The use of Markov models in text compression is a rich and active area of research. We

describe some of these approaches in Chapter 6 (for more details, see [ 1 ]).

Introduction to Data Compression

Search WWH ::

Custom Search

Home