Arithmetic Coding - Introduction to Data Compression

Databases Reference

In-Depth Information

T A B L E 4 . 2

Huffman code for extended

alphabet.

Letter

Probability

Code

a 1 a 1

0.9025

0

a 1 a 2

0.0190

111

a 1 a 3

0.0285

100

a 2 a 1

0.0190

1101

a 2 a 2

0.0004

110011

a 2 a 3

0.0006

110001

a 3 a 1

0.0285

101

a 3 a 2

0.0006

110010

a 3 a 3

0.0009

110000

sequences without having to generate codes for all sequences of that length. The arithmetic

coding technique fulfills this requirement.

In arithmetic coding, a unique identifier or tag is generated for the sequence to be encoded.

This tag corresponds to a binary fraction, which becomes the binary code for the sequence.

In practice, the generation of the tag and the binary code are the same process. However, the

arithmetic coding approach is easier to understand if we conceptually divide the approach into

two phases. In the first phase, a unique identifier or tag is generated for a given sequence

of symbols. This tag is then given a unique binary code. A unique arithmetic code can

be generated for a sequence of length m without the need for generating codewords for all

sequences of length m . This is unlike the situation for Huffman codes. In order to generate

a Huffman code for a sequence of length m , where the code is not a concatenation of the

codewords for the individual symbols, we need to obtain the Huffman codes for all sequences

of length m .

4.3 Coding a Sequence

In order to distinguish a sequence of symbols from another sequence of symbols we need to

tag it with a unique identifier. One possible set of tags for representing sequences of symbols

are the numbers in the unit interval [0,1). Because the number of numbers in the unit interval

is infinite, it should be possible to assign a unique tag to each distinct sequence of symbols. In

order to do this, we need a function that will map sequences of symbols into the unit interval.

A function that maps random variables, and sequences of random variables, into the unit

interval is the cumulative distribution function ( cdf ) of the random variable associated with

the source. This is the function we will use in developing the arithmetic code. (If you are not

familiar with random variables and cumulative distribution functions or need to refresh your

memory, you may wish to look at Appendix A.)

The use of the cumulative distribution function to generate a binary code for a sequence has a

rather interesting history. Shannon, in his original 1948 paper [ 3 ], mentioned an approach using

Introduction to Data Compression

Search WWH ::

Custom Search

Home