Databases Reference
In-Depth Information
probability of g would increase substantially. As we increase the context size (go from n to
in to din and so on), the probability of the alphabet becomes more and more skewed, which
results in lower entropy.
Shannon used a second-order model for English text consisting of the 26 letters and one
space to obtain an entropy of 3.1 bits/letter [ 4 ]. Using a model where the output symbols
were words rather than letters brought down the entropy to 2.4 bits/letter. Shannon then used
predictions generated by people (rather than statistical models) to estimate the upper and lower
bounds on the entropy of the second-order model. For the case where the subjects knew the
100 previous letters, he estimated these bounds to be 1.3 and 0.6 bits/letter, respectively.
The longer the context, the better its predictive value. However, if we were to store the
probability model with respect to all contexts of a given length, the number of contexts would
grow exponentially with the length of context. Furthermore, given that the source imposes
some structure on its output, many of these contexts may correspond to strings that would
never occur in practice. Consider a context model of order four (the context is determined by
the last four symbols). If we take an alphabet size of 95, the possible number of contexts is
95 4 —more than 81 million!
This problem is further exacerbated by the fact that different realizations of the source
output may vary considerably in terms of repeating patterns. Therefore, context modeling
in text compression schemes tends to be an adaptive strategy in which the probabilities for
different symbols in the different contexts are updated as they are encountered. However, this
means that we will often encounter symbols that have not been encountered before for any of
the given contexts (this is known as the zero frequency problem ). The larger the context, the
more often this will happen. This problem could be resolved by sending a code to indicate
that the following symbol was being encountered for the first time, followed by a prearranged
code for that symbol. This would significantly increase the length of the code for the symbol
on its first occurrence (in the given context). However, if this situation did not occur too often,
the overhead associated with such occurrences would be small compared to the total number
of bits used to encode the output of the source. Unfortunately, in context-based encoding, the
zero frequency problem is encountered often enough for overhead to be a problem, especially
for longer contexts. Solutions to this problem are presented by the ppm (prediction with partial
match) algorithm and its variants (described in detail in Chapter 6).
Briefly, the ppm algorithm first attempts to find if the symbol to be encoded has a nonzero
probability with respect to the maximum context length. If this is so, the symbol is encoded
and transmitted. If not, an escape symbol is transmitted, the context size is reduced by one,
and the process is repeated. This procedure is repeated until a context is found with respect to
which the symbol has a nonzero probability. To guarantee that this process converges, a null
context is always included with respect to which all symbols have equal probability. Initially,
only the shorter contexts are likely to be used. However, as more and more of the source output
is processed, the longer contexts, which offer better prediction, will be used more often. The
probability of the escape symbol can be computed in a number of different ways leading to
different implementations [ 1 ].
The use of Markov models in text compression is a rich and active area of research. We
describe some of these approaches in Chapter 6 (for more details, see [ 1 ]).
Search WWH ::




Custom Search