Dictionary Techniques - Introduction to Data Compression

Databases Reference

In-Depth Information

When the numbers of entries in the dictionary exceed a prearranged threshold C 3 ,the

encoder sends the STEPUP control code, and the codeword size is incremented by 1 bit. At

the same time, the threshold C 3 is also doubled. When all available dictionary entries are filled,

the algorithm initiates a reuse procedure. The location of the first string entry in the dictionary

is maintained in a variable N 5 . Starting from N 5 , a counter C 1 is incremented until it finds a

dictionary entry that is not a prefix to any other dictionary entry. The fact that this entry is not

a prefix to another dictionary entry means that this pattern has not been encountered since it

was created. Furthermore, because of the way it was located, this pattern has been around the

longest among patterns of this kind. This reuse procedure enables the algorithm to prune the

dictionary of strings that may have been encountered in the past but have not been encountered

recently, on a continual basis. In this way, the dictionary is always matched to the current

source statistics.

To reduce the effect of errors, the CCITT recommends setting a maximum string length.

This maximum length is negotiated at link setup. The CCITT recommends a range of 6-250,

with a default value of 6.

The V.42 bis recommendation avoids the need for an exception handler for the case where

the decoder receives a codeword corresponding to an incomplete entry by forbidding the use

of the last entry in the dictionary. Instead of transmitting the codeword corresponding to

the last entry, the recommendation requires the sending of the codewords corresponding to

the constituents of the last entry. In the example used to demonstrate this quirk of the LZW

algorithm, the V.42 bis recommendation would have forced us to send the codewords 3 and 1

instead of transmitting the codeword 5.

5.6 Beyond Compression __ Lempel-Ziv

Complexity

In Chapter 2, we described the notion of Kolmogorov complexity, noting that the Kolmogorov

complexity of a sequence was not something that we could necessarily compute. Lempel and

Ziv provided a definition of complexity in 1976 [ 57 ] that is computable. This notion views

complexity as the level of diversity in a sequence. Thus, a sequence that consists of repetitions

of the same letter would not be a complex sequence, while a sequence that appears random

would be complex. The development of this definition of complexity formed the basis for

the two compression approaches that were to follow in the next two years. Since then, this

definition of complexity has had an impact in areas other than compression. We will take a brief

detour into these applications as they confirm our assertion that data compression algorithms

are ways of modeling and understanding how information is organized in data.

The computation of Lempel-Ziv (LZ) complexity is performed in a manner very similar

to the LZ77 algorithm. The basic idea is to parse the input sequence into phrases that do not

exist in the past of the sequence. Thus, a phrase is built up by copying from the past of the

sequence until we add a letter that makes the phrase distinct from anything in the past of the

sequence. Let's see how this works by using a familiar phrase, cabracadabrarrarrad .We

begin with an empty history so the first letter c is a unique phrase, as are the letters a , b , and r .

The next letter in the sequence is a , which has occurred in the past, so we add the next letter

Search WWH ::

Custom Search

Home