Java Reference
In-Depth Information
determine the expected savings from Huffman coding if the actual frequencies of a
coded message match the expected frequencies.
Example5.11 Because the sum of the frequencies in Figure 5.31 is 306
and E has frequency 120, we expect it to appear 120 times in a message
containing 306 letters. An actual message might or might not meet this
expectation. Letters D, L, and U have code lengths of three, and together
are expected to appear 121 times in 306 letters. Letter C has a code length of
four, and is expected to appear 32 times in 306 letters. Letter M has a code
length of five, and is expected to appear 24 times in 306 letters. Finally,
letters K and Z have code lengths of six, and together are expected to appear
only 9 times in 306 letters. The average expected cost per character is
simply the sum of the cost for each character (c i ) times the probability of
its occurring (p i ), or
c 1 p 1 + c 2 p 2 + + c n p n :
This can be reorganized as
c 1 f 1 + c 2 f 2 + + c n f n
f T
where f i is the (relative) frequency of letter i and f T is the total for all letter
frequencies. For this set of frequencies, the expected cost per letter is
[(1120)+(3121)+(432)+(524)+(69)]=306 = 785=306 2:57
A fixed-length code for these eight characters would require log 8 = 3 bits
per letter as opposed to about 2.57 bits per letter for Huffman coding. Thus,
Huffman coding is expected to save about 14% for this set of letters.
Huffman coding for all ASCII symbols should do better than this. The letters of
Figure 5.31 are atypical in that there are too many common letters compared to the
number of rare letters. Huffman coding for all 26 letters would yield an expected
cost of 4.29 bits per letter. The equivalent fixed-length code would require about
five bits. This is somewhat unfair to fixed-length coding because there is actually
room for 32 codes in five bits, but only 26 letters. More generally, Huffman coding
of a typical text file will save around 40% over ASCII coding if we charge ASCII
coding at eight bits per character. Huffman coding for a binary file (such as a
compiled executable) would have a very different set of distribution frequencies and
so would have a different space savings. Most commercial compression programs
use two or three coding schemes to adjust to different types of files.
In the preceding example, “DEED” was coded in 8 bits, a saving of 33% over
the twelve bits required from a fixed-length coding. However, “MUCK” requires
 
Search WWH ::




Custom Search