Context-Based Compression - Introduction to Data Compression

Databases Reference

In-Depth Information

this reduces the probability of several letters, such as q and z , occurring and boosts the proba-

bility of an a occurring. In this example, b would be the first-order context for a

ob would be

the second-order context for a , and so on. Using more letters to define the context in which a

occurs, or higher-order contexts, will generally increase the probability of the occurrence of a

in this example and, hence, reduce the number of bits required to encode its occurrence. There-

fore, what we would like to do is to encode each letter using the probability of its occurrence

with respect to a context of high order.

If we want to have probabilities with respect to all possible high-order contexts, this might

be an overwhelming amount of information. Consider an alphabet of size M . The number of

first-order contexts is M , the number of second-order contexts is M 2 , and so on. Therefore, if

we wanted to encode a sequence from an alphabet of size 256 using contexts of order 5, we

would need 256 5 , or about 1

,

10 12 probability distributions! This is not a practical

alternative. A set of algorithms that resolve this problem in a very simple and elegant way is

based on the prediction with partial match approach. We will describe this in the next section.

.

09951

×

6.3 Prediction with Partial Match (ppm)

The best-known context-based algorithm is the ppm algorithm, first proposed by Cleary and

Witten [ 74 ] in 1984. It has not been as popular as the various Ziv-Lempel-based algorithms

mainly because of the faster execution speeds of the latter algorithms. Lately, with the devel-

opment of more efficient variants, ppm -based algorithms are becoming increasingly popular.

The idea of the ppm algorithm is elegantly simple. We would like to use large contexts

to determine the probability of the symbol being encoded. However, the use of large contexts

would require us to estimate and store an extremely large number of conditional probabilities,

which might not be feasible. Instead of estimating these probabilities ahead of time, we can

reduce the burden by estimating the probabilities as the coding proceeds. This way we only

need to store those contexts that have occurred in the sequence being encoded. This is a much

smaller number than the number of all possible contexts. While this mitigates the problem of

storage, it also means that, especially at the beginning of an encoding, we will need to code

letters that have not occurred previously in this context. In order to handle this situation, the

source coder alphabet always contains an escape symbol, which is used to signal that the letter

to be encoded has not been seen in this context.

6.3.1 The Basic Algorithm

The basic algorithm initially attempts to use the largest context. The size of the largest context

is predetermined. If the symbol to be encoded has not previously been encountered in this

context, an escape symbol is encoded and the algorithmattempts to use the next smaller context.

If the symbol has not occurred in this context either, the size of the context is further reduced.

This process continues until either we obtain a context that has previously been encountered

with this symbol or we arrive at the conclusion that the symbol has not been encountered

previously in any context.

M to encode the symbol,

where M is the size of the source alphabet. For example, when coding the a of probability ,

In this case, we use probability of 1

/

Introduction to Data Compression

Search WWH ::

Custom Search

Home