Huffman Coding - Introduction to Data Compression

Databases Reference

In-Depth Information

3.3 Nonbinary Huffman Codes

The binary Huffman coding procedure can be easily extended to the nonbinary case where

the code elements come from an m -ary alphabet, and m is not equal to two. Recall that we

obtained the Huffman algorithm based on the following observations about an optimum binary

prefix code:

1. Symbols that occur more frequently (have a higher probability of occurrence) will have

shorter codewords than symbols that occur less frequently.

2. The two symbols that occur least frequently will have the same length.

Also recall that an additional requirement is that the two symbols with the lowest probability

differ only in the last position.

We can obtain a nonbinary Huffman code in almost exactly the same way. The obvious

thing to do would be to modify the second observation to read “The m symbols that occur least

frequently will have the same length,” and also modify the additional requirement to read “The

m symbols with the lowest probability differ only in the last position.”

However, we run into a small problem with this approach. Consider the design of a ternary

Huffman code for a source with a six-letter alphabet. Using the rules described above, we

would first combine the three letters with the lowest probability into a composite letter. This

would give us a reduced alphabet with four letters. However, combining the three letters with

lowest probability from this alphabet would result in a further reduced alphabet consisting of

only two letters. We have three values to assign and only two letters. Instead of combining

three letters at the beginning, we could have combined two letters. This would result in a

reduced alphabet of size five. If we combined three letters from this alphabet, we would

end up with a final reduced alphabet size of three. Finally, we could combine two letters in

the second step, which would again result in a final reduced alphabet of size three. Which

alternative should we choose?

Recall that the symbols with the lowest probability will have the longest codeword. Fur-

thermore, all of the symbols that we combine together into a composite symbol will have

codewords of the same length. This means that all letters we combine together at the very first

stage will have codewords that have the same length, and these codewords will be the longest

of all the codewords. If at some stage we are allowed to combine less than m symbols, the

logical place to do this would be in the very first stage.

In the general case of an m -ary code and an M -letter alphabet, how many letters should

we combine in the first phase? Let m be the number of letters that are combined in the first

phase. Then m is the number between two and m , such that m modulo

(

−

)

is equal to M

modulo

(

−

)

Example3.3.1:

Generate a ternary Huffman code for a source with a six-letter alphabet and a probability model

(

a 1 ) =

(

a 3 ) =

(

a 4 ) =

(

a 5 ) =

(

a 6 ) =

(

a 2 ) =

1, and P

05. In this case

3, therefore m is either 2 or 3.

(

mod 2

) =

(

mod 2

) =

(

mod 2

) =

Introduction to Data Compression

Search WWH ::

Custom Search

Home