Databases Reference
In-Depth Information
this reduces the probability of several letters, such as q and z , occurring and boosts the proba-
bility of an a occurring. In this example, b would be the first-order context for a
ob would be
the second-order context for a , and so on. Using more letters to define the context in which a
occurs, or higher-order contexts, will generally increase the probability of the occurrence of a
in this example and, hence, reduce the number of bits required to encode its occurrence. There-
fore, what we would like to do is to encode each letter using the probability of its occurrence
with respect to a context of high order.
If we want to have probabilities with respect to all possible high-order contexts, this might
be an overwhelming amount of information. Consider an alphabet of size M . The number of
first-order contexts is M , the number of second-order contexts is M 2 , and so on. Therefore, if
we wanted to encode a sequence from an alphabet of size 256 using contexts of order 5, we
would need 256 5 , or about 1
,
10 12 probability distributions! This is not a practical
alternative. A set of algorithms that resolve this problem in a very simple and elegant way is
based on the prediction with partial match approach. We will describe this in the next section.
.
09951
×
6.3 Prediction with Partial Match (ppm)
The best-known context-based algorithm is the ppm algorithm, first proposed by Cleary and
Witten [ 74 ] in 1984. It has not been as popular as the various Ziv-Lempel-based algorithms
mainly because of the faster execution speeds of the latter algorithms. Lately, with the devel-
opment of more efficient variants, ppm -based algorithms are becoming increasingly popular.
The idea of the ppm algorithm is elegantly simple. We would like to use large contexts
to determine the probability of the symbol being encoded. However, the use of large contexts
would require us to estimate and store an extremely large number of conditional probabilities,
which might not be feasible. Instead of estimating these probabilities ahead of time, we can
reduce the burden by estimating the probabilities as the coding proceeds. This way we only
need to store those contexts that have occurred in the sequence being encoded. This is a much
smaller number than the number of all possible contexts. While this mitigates the problem of
storage, it also means that, especially at the beginning of an encoding, we will need to code
letters that have not occurred previously in this context. In order to handle this situation, the
source coder alphabet always contains an escape symbol, which is used to signal that the letter
to be encoded has not been seen in this context.
6.3.1 The Basic Algorithm
The basic algorithm initially attempts to use the largest context. The size of the largest context
is predetermined. If the symbol to be encoded has not previously been encountered in this
context, an escape symbol is encoded and the algorithmattempts to use the next smaller context.
If the symbol has not occurred in this context either, the size of the context is further reduced.
This process continues until either we obtain a context that has previously been encountered
with this symbol or we arrive at the conclusion that the symbol has not been encountered
previously in any context.
M to encode the symbol,
where M is the size of the source alphabet. For example, when coding the a of probability ,
In this case, we use probability of 1
/
Search WWH ::




Custom Search