Huffman Coding - Introduction to Data Compression

Databases Reference

In-Depth Information

0 1

1

0

1

0

F I GU R E 3 . 8

Tree for the canonical Huffman code.

We illustrate this procedure using the example we have been considering in this section.

Our alphabet consists of five letters

{

a 1 ,

a 2 ,

a 3 ,

a 4 ,

a 5 }

. The lengths of the corresponding

Huffman codewords are

. We generate the codewords from the shortest to the

longest keeping in mind the constraint that the shorter codewords lexicographically precede

longer codewords. The shortest codeword is that assigned to a 2 . The lexicographically smallest

codeword of length one is 0, so the codeword for a 2 is 0. The codewords of length two have to

be of the form 1x. There is only one codeword of length two required, that being the codeword

for a 1 , so we assign the codeword 10 to a 1 . The codewords of length three now have to be of

the form 11x. We only need a single codeword of length three, the codeword for a 3 ; therefore,

the codeword for a 3 is 110. There are two codewords, a 4 and a 5 , of length four. Therefore, the

codeword for a 4 is 1110, and the codeword for a 5 is 1111. This code can now be encoded by

just sending the lengths. Unlike the previous code we know exactly which codeword belongs

to which letter in the alphabet. This is because of the lexicographic constraints. As can be

seen from the tree for this code in Figure 3.8 , the tree increases in depth as we move from left

to right. We know from the sequence of lengths that the shortest codeword is for the letter

a 2 , the next longest codeword is for the letter a 1 , and so on. Thus simply encoding the list of

lengths is sufficient for storing the Huffman code. For an alphabet of size five, this is not a

significant saving; however, when the alphabet size is on the order of 256, the savings can be

very significant.

{

2

,

1

,

3

,

4

,

4

}

3.2.3 Length-Limited Huffman Codes

The Huffman coding algorithm tries to minimize the average length of codewords. However,

there are no limits on the maximum length of an individual codeword. This is not necessarily a

problem when dealing with limited alphabet sizes. However, if the alphabet size is relatively

large there is a possibility of getting very long Huffman codes. This could result in codewords

that do not fit in a single machine word. This in turn can lead to inefficiencies in the coding

process. These problems can be avoided by adding an additional constraint to the variable

length code design problem: a requirement that all codewords be less than or equal to some

maximum codeword length l max .If m is the size of the source alphabet then clearly l max has

to be greater than or equal to

log 2 m

for the code to be a valid code.

Introduction to Data Compression

Search WWH ::

Custom Search

Home