Dictionary Techniques - Introduction to Data Compression

Databases Reference

In-Depth Information

encoded with a reference to the dictionary. If the pattern does not appear in the dictionary,

then it can be encoded using some other less efficient method. In effect, we are splitting the

input into two classes: frequently occurring patterns and infrequently occurring patterns. For

this technique to be effective, the class of frequently occurring patterns, and hence the size of

the dictionary, must be much smaller than the number of all possible patterns.

Suppose we have a particular text that consists of four-character words, three characters

from the 26 lowercase letters of the English alphabet followed by a punctuation mark. Sup-

pose our source alphabet consists of the 26 lowercase letters of the English alphabet and the

punctuation marks comma, period, exclamation mark, question mark, semicolon, and colon.

In other words, the size of the input alphabet is 32. If we were to encode the text source one

character at a time, treating each character as an equally likely event, we would need 5 bits per

character. Treating all 32 4

2 20

four-character patterns as equally likely, we

have a code that assigns 20 bits to each four-character pattern. Let us now put the 256 most

likely four-character patterns into a dictionary. The transmission scheme works as follows.

Whenever we want to send a pattern that exists in the dictionary, we will send a 1-bit flag, say,

a 0, followed by an 8-bit index corresponding to the entry in the dictionary. If the pattern is

not in the dictionary, we will send a 1 followed by the 20-bit encoding of the pattern. If the

pattern we encounter is not in the dictionary, we will actually use more bits than in the original

scheme, 21 instead of 20. But if it is in the dictionary, we will send only 9 bits. The utility of

our scheme will depend on the percentage of the words we encounter that are in the dictionary.

We can get an idea about the utility of our scheme by calculating the average number of bits per

pattern. If the probability of encountering a pattern from the dictionary is p , then the average

number of bits per pattern R is given by

( =

=

1

,

048

,

576

)

R

=

9 p

+

21

(

1

−

p

) =

21

−

12 p

(5.1)

For our scheme to be useful, R should have a value less than 20. This happens when p

084.

This does not seem like a very large number. However, note that if all patterns were occurring

in an equally likely manner, the probability of encountering a pattern from the dictionary would

be less than 0.00025!

We do not simply want a coding scheme that performs slightly better than the simple-

minded approach of coding each pattern as equally likely; we would like to improve the

performance as much as possible. In order for this to happen, p should be as large as possible.

This means that we should carefully select patterns that are most likely to occur as entries in

the dictionary. To do this, we have to have a pretty good idea about the structure of the source

output. If we do not have information of this sort available to us prior to the encoding of a

particular source output, we need to acquire this information somehow when we are encoding.

If we feel we have sufficient prior knowledge, we can use a static approach; if not, we can take

an adaptive approach. We will look at both these approaches in this chapter.

0

.

5.3 Static Dictionary

Choosing a static dictionary technique is most appropriate when considerable prior knowl-

edge about the source is available. This technique is especially suitable for use in specific

Introduction to Data Compression

Search WWH ::

Custom Search

Home