Mathematical Preliminaries for Lossless Compression - Introduction to Data Compression

Databases Reference

In-Depth Information

2.3 Models

As we saw in Section 2.2 , having a good model for the data can be useful in estimating the

entropy of the source. As we will see in later chapters, good models for sources lead to more

efficient compression algorithms. In general, in order to develop techniques that manipulate

data using mathematical operations, we need to have a mathematical model for the data.

Obviously, the better the model (i.e., the closer the model matches the aspects of reality that

are of interest to us), the more likely it is that we will come up with a satisfactory technique.

There are several approaches to building mathematical models.

2.3.1 Physical Models

If we know something about the physics of the data generation process, we can use that

information to construct a model. For example, in speech-related applications, knowledge

about the physics of speech production can be used to construct a mathematical model for

the sampled speech process. Sampled speech can then be encoded using this model. We will

discuss speech production models in more detail in Chapter 8 and Chapter 18.

Models for certain telemetry data can also be obtained through knowledge of the underlying

process. For example, if residential electrical meter readings at hourly intervals were to be

coded, knowledge about the living habits of the populace could be used to determine when

electricity usage would be high and when the usage would be low. Then instead of the actual

readings, the difference (residual) between the actual readings and those predicted by the model

could be coded.

In general, however, the physics of data generation is simply too complicated to understand,

let alone use to develop a model. Where the physics of the problem is too complicated, we

can obtain a model based on empirical observation of the statistics of the data.

2.3.2 Probability Models

The simplest statistical model for the source is to assume that each letter that is generated by the

source is independent of every other letter, and each occurs with the same probability. We could

call this the ignorance model , as it would generally be useful only when we know nothing about

the source. (Of course, that really might be true, in which case we have a rather unfortunate

name for the model!) The next step up in complexity is to keep the independence assumption,

but remove the equal probability assumption and assign a probability of occurrence to each letter

in the alphabet. For a source that generates letters from an alphabet

A ={

a 1 ,

a 2 ,...,

a M }

,

we can have a probability model

.

Given a probability model (and the independence assumption), we can compute the entropy

of the source using Equation ( 4 ). As we will see in the following chapters using the probability

model, we can also construct some very efficient codes to represent the letters in

P ={

P

(

a 1 ),

P

(

a 2 ),...,

P

(

a M ) }

. Of course,

these codes are only efficient if our mathematical assumptions are in accord with reality.

If the assumption of independence does not fit with our observation of the data, we can

generally find better compression schemes if we discard this assumption. When we discard

the independence assumption, we have to come up with a way to describe the dependence of

elements of the data sequence on each other.

A

Introduction to Data Compression

Search WWH ::

Custom Search

Home