Databases Reference
In-Depth Information
2.3 Models
As we saw in Section 2.2 , having a good model for the data can be useful in estimating the
entropy of the source. As we will see in later chapters, good models for sources lead to more
efficient compression algorithms. In general, in order to develop techniques that manipulate
data using mathematical operations, we need to have a mathematical model for the data.
Obviously, the better the model (i.e., the closer the model matches the aspects of reality that
are of interest to us), the more likely it is that we will come up with a satisfactory technique.
There are several approaches to building mathematical models.
2.3.1 Physical Models
If we know something about the physics of the data generation process, we can use that
information to construct a model. For example, in speech-related applications, knowledge
about the physics of speech production can be used to construct a mathematical model for
the sampled speech process. Sampled speech can then be encoded using this model. We will
discuss speech production models in more detail in Chapter 8 and Chapter 18.
Models for certain telemetry data can also be obtained through knowledge of the underlying
process. For example, if residential electrical meter readings at hourly intervals were to be
coded, knowledge about the living habits of the populace could be used to determine when
electricity usage would be high and when the usage would be low. Then instead of the actual
readings, the difference (residual) between the actual readings and those predicted by the model
could be coded.
In general, however, the physics of data generation is simply too complicated to understand,
let alone use to develop a model. Where the physics of the problem is too complicated, we
can obtain a model based on empirical observation of the statistics of the data.
2.3.2 Probability Models
The simplest statistical model for the source is to assume that each letter that is generated by the
source is independent of every other letter, and each occurs with the same probability. We could
call this the ignorance model , as it would generally be useful only when we know nothing about
the source. (Of course, that really might be true, in which case we have a rather unfortunate
name for the model!) The next step up in complexity is to keep the independence assumption,
but remove the equal probability assumption and assign a probability of occurrence to each letter
in the alphabet. For a source that generates letters from an alphabet
A ={
a 1 ,
a 2 ,...,
a M }
,
we can have a probability model
.
Given a probability model (and the independence assumption), we can compute the entropy
of the source using Equation ( 4 ). As we will see in the following chapters using the probability
model, we can also construct some very efficient codes to represent the letters in
P ={
P
(
a 1 ),
P
(
a 2 ),...,
P
(
a M ) }
. Of course,
these codes are only efficient if our mathematical assumptions are in accord with reality.
If the assumption of independence does not fit with our observation of the data, we can
generally find better compression schemes if we discard this assumption. When we discard
the independence assumption, we have to come up with a way to describe the dependence of
elements of the data sequence on each other.
A
Search WWH ::




Custom Search