Cryptography Reference
In-Depth Information
so it would be statistically correct, but it did not explain how to hide
information in the process. Nor did it explain how to run Huffman
compression in reverse.
The information is hidden by letting the data to be concealed
dictate the choice of the next letter. In the example described above,
either “a”, “e”, or “o” could follow the starting letters “The l”. It is easy
to come up with a simple scheme for encoding information. If “a”
stands for “1”, “e” stands for “2” and “o” stands for “3”, then common
numbers could be encoded in the choice of the letters. Someone at
a distance could recover this value if they had a copy of the same
source text,
, that generated the table of statistics. The could look
up “The l” and discover that there are three letters that follow “he l”
in the table. The letter “e” is the second choice in alphabetical order,
so the letter “e” stands for the message “2”.
A long text like the one shown in Figure 6.1 could hide a different
number in each letter. If there were no choice about the next letter
to be added to the output, though, then no information could be
hidden. That letter would not hide anything.
Simply using a letter to encode a number is not an efficient or a
flexible way to send data. What if you wanted to send the message
“4” and there were only three choices? What if you wanted to send
a long picture? What if your data wanted to send the value “1”, but
the first letter was the least common choice. Would this destroy the
statistical composition?
Running Huffman codes in reverse is the solution to all of these
problems. Figure 6.2 shows a simple Huffman tree constructed from
the three choices of letters to follow “The l”. The tree was constructed
using the statistics that showed that the letter “e” followed in 16 out of
the 20 times while the letters “a” and “o” both followed twice apiece.
Messages are encoded with a Huffman tree like this with a vari-
able number of bits. The choice of “e” encodes the bit “0”; the choice
of “a” encodes “10”; and the choice of “o” encodes the message “11”.
These bits can be recovered at the other end by reversing this choice.
The number of bits that are hidden with each choice of a letter varies
directly with the number of choices that are possible and the proba-
bilities that govern the choice.
There should generally bemore than three choices available if the
source text
S
is large enough to offer some variation, but there will
rarely be a full 26 choices. This is only natural because English has
plenty of redundancy built into the language. Shannon recognized
this when he set up information theory. If the average entropy of En-
glish is about 3 bits per character, then this means that there should
only be about 2 3 oreightchoicesthatcanbemadeforthenextchar-
S
Search WWH ::




Custom Search