Databases Reference
In-Depth Information
same symbol are called deterministic contexts. The ppm algorithm first looks for the longest
deterministic context. If the symbol to be encoded does not occur in that context, an escape
symbol is encoded and the algorithm defaults to the maximum context length. This approach
seems to provide a small but significant amount of improvement over the basic algorithm.
Currently, the best variant of the ppm algorithm is the ppmz algorithm by Charles Bloom.
Details of the ppmz algorithm as well as implementations of the algorithm can be found at
http://www.cbloom.com/src/ppmz.html .
6.3.4 The Exclusion Principle
The basic idea behind arithmetic coding is the division of the unit interval into subintervals,
each of which represents a particular letter. The smaller the subinterval, the more bits are
required to distinguish it from other subintervals. If we can reduce the number of symbols to
be represented, the number of subintervals goes down as well. This in turn means that the sizes
of the subintervals increase, leading to a reduction in the number of bits required for encoding.
The exclusion principle used in ppm provides this kind of reduction in rate. Suppose we have
been compressing a text sequence and come upon the sequence proba , and suppose we are
trying to encode the letter a . Suppose also that the state of the two-letter context ob and the
one-letter context b are as shown in Table 6.12 .
First, we attempt to encode a with the two-letter context. As a does not occur in this
context, we issue an escape symbol and reduce the size of the context. Looking at the table
for the one-letter context b , we see that a does occur in this context with a count of 4 out of a
total possible count of 21. Notice that other letters in this context include l and o . However,
by sending the escape symbol in the context of ob , we have already signalled to the decoder
that the symbol being encoded is not any of the letters that have previously been encountered
in the context of ob . Therefore, we can increase the size of the subinterval corresponding
to a by temporarily removing l and o from the table. Instead of using Table 6.12 ,weuse
Table 6.13 to encode a . This exclusion of symbols from contexts on a temporary basis can
result in cumulatively significant savings in terms of rate.
T A B L E 6 . 12
Counts for exclusion example.
Context
Symbol
Count
ob
l
10
o
03
Esc
02
Total_Count
15
b
l
05
o
03
a
04
r
02
e
02
Esc
05
Total_Count
21
 
Search WWH ::




Custom Search