Databases Reference
In-Depth Information
T A B L E 5 . 2
Thirty most frequently occurring
pairs of characters in a
41,364-character-long LaTeX
document.
Pair
Count
Pair
Count
eb
1128
ar
314
bt
838
at
313
bb
823
b w
309
th
817
te
296
he
712
bs
295
in
512
db
272
sb
494
bo
266
er
433
io
257
ba
425
co
256
tb
401
re
247
en
392
b $
246
on
385
rb
239
nb
353
di
230
ti
322
ic
229
bi
317
ct
226
Suppose we wish to encode the sequence
abracadabra
The encoder reads the first two characters ab and checks to see if this pair of letters exists in
the dictionary. It does and is encoded using the codeword 101. The encoder then reads the
next two characters ra and checks to see if this pair occurs in the dictionary. It does not, so the
encoder sends out the code for r , which is 100, then reads in one more character, c , to make the
two-character pattern ac . This does exist in the dictionary and is encoded as 110. Continuing
in this fashion, the remainder of the sequence is coded. The output string for the given input
sequence is 101100110111101100000.
A list of the 30 most frequently occurring pairs of characters in an earlier version of this
chapter is shown in Table 5.2 . For comparison, the 30 most frequently occurring pairs of
characters in a set of C programs is shown in Table 5.3 .
In these tables,
b corresponds to a space and nl corresponds to a new line. Notice how
different the two tables are. It is easy to see that a dictionary designed for compressing LaTeX
documents would not work very well when compressing C programs. However, we generally
want techniques that will be able to compress a variety of source outputs. If we want to
compress computer files, we do not want to change techniques based on the content of the file.
Rather, we would like the technique to adapt to the characteristics of the source output. We
discuss adaptive-dictionary-based techniques in the next section.
/
 
Search WWH ::




Custom Search