Cryptography Reference
In-Depth Information
algorithm moved through the “j” part of the dictionary, the node
containing “j” would be pushed repeatedly to the top of the splay tree
where it would get a short code word. Later, when the compression
function got to the “z” section, the node for “z” would end up near the
top consistently giving “z” a short code word. Obviously one major
problem with this compression scheme is that the entire file must
be processed from the beginning to keep an accurate description of
the splay tree. You can't simply jump to the “z” section and begin
decompressing.
A good basic reference
on compression is
[Sto88].
This basic Huffman algorithm has many different uses. It will be
in Chapter 6 to turn data into something that looks like English text.
Huffman encoding is also used as a building block in Chapter 7 to
make optimal weighted choices between different words. The same
structureisasusefulthereasitishere.
5.3.2 Dictionary Compression
Compression schemes like the popular and patented Lempel-Ziv al-
gorithm are called dictionary schemes because they build a big list
of common words in the file. 2 This list can either be created in one
swoop at the beginning of the compression or it could be built and
changed adaptively as the algorithm processes the file. The algo-
rithms succeed because a pointer describing the position in the dic-
tionary takes up much less space than the common word itself.
The dictionary is just a list of words. It is almost always 2 n words
because that makes the pointer to a particular word take up
bits.
Each word can either be a fixed length or a flexible length. Fixed
lengths are easier to handle, but flexible lengths do a better job of
approximating the English language and x86 machine code.
Compression follows easily. First, the file is analyzed to create a
list of the 2 n most commonwords. Then the file is processed by scan-
ning from beginning to end. If the current word is in the dictionary,
then it is replaced by a tag, <InDict> , followed by the position in the
dictionary. If it isn't in the dictionary, then it is replaced by a tag,
<Verbatim> , followed by the word that remains unchanged.
Obviously, the success of the algorithm depends on the size of the
tags ( <InDict> and <Verbatim> ), the size of the dictionary, and the
number of times something is found in the dictionary. One basic and
usually effective solution is to make the tags be one entire byte,
n
B
.If
the value of the byte is zero, then the next
n
bits represents a word in
2 The algorithms are not particularly good at compressing files like dictionaries
used by humans. The fact that I used a regular dictionary as an example in the previous
section is just a coincidence. Don't be confused.
Search WWH ::




Custom Search