Databases Reference
In-Depth Information
encoded with a reference to the dictionary. If the pattern does not appear in the dictionary,
then it can be encoded using some other less efficient method. In effect, we are splitting the
input into two classes: frequently occurring patterns and infrequently occurring patterns. For
this technique to be effective, the class of frequently occurring patterns, and hence the size of
the dictionary, must be much smaller than the number of all possible patterns.
Suppose we have a particular text that consists of four-character words, three characters
from the 26 lowercase letters of the English alphabet followed by a punctuation mark. Sup-
pose our source alphabet consists of the 26 lowercase letters of the English alphabet and the
punctuation marks comma, period, exclamation mark, question mark, semicolon, and colon.
In other words, the size of the input alphabet is 32. If we were to encode the text source one
character at a time, treating each character as an equally likely event, we would need 5 bits per
character. Treating all 32 4
2 20
four-character patterns as equally likely, we
have a code that assigns 20 bits to each four-character pattern. Let us now put the 256 most
likely four-character patterns into a dictionary. The transmission scheme works as follows.
Whenever we want to send a pattern that exists in the dictionary, we will send a 1-bit flag, say,
a 0, followed by an 8-bit index corresponding to the entry in the dictionary. If the pattern is
not in the dictionary, we will send a 1 followed by the 20-bit encoding of the pattern. If the
pattern we encounter is not in the dictionary, we will actually use more bits than in the original
scheme, 21 instead of 20. But if it is in the dictionary, we will send only 9 bits. The utility of
our scheme will depend on the percentage of the words we encounter that are in the dictionary.
We can get an idea about the utility of our scheme by calculating the average number of bits per
pattern. If the probability of encountering a pattern from the dictionary is p , then the average
number of bits per pattern R is given by
( =
=
1
,
048
,
576
)
R
=
9 p
+
21
(
1
p
) =
21
12 p
(5.1)
For our scheme to be useful, R should have a value less than 20. This happens when p
084.
This does not seem like a very large number. However, note that if all patterns were occurring
in an equally likely manner, the probability of encountering a pattern from the dictionary would
be less than 0.00025!
We do not simply want a coding scheme that performs slightly better than the simple-
minded approach of coding each pattern as equally likely; we would like to improve the
performance as much as possible. In order for this to happen, p should be as large as possible.
This means that we should carefully select patterns that are most likely to occur as entries in
the dictionary. To do this, we have to have a pretty good idea about the structure of the source
output. If we do not have information of this sort available to us prior to the encoding of a
particular source output, we need to acquire this information somehow when we are encoding.
If we feel we have sufficient prior knowledge, we can use a static approach; if not, we can take
an adaptive approach. We will look at both these approaches in this chapter.
0
.
5.3 Static Dictionary
Choosing a static dictionary technique is most appropriate when considerable prior knowl-
edge about the source is available. This technique is especially suitable for use in specific
Search WWH ::




Custom Search