Database Reference
In-Depth Information
Theorem 8.6
Let P be a distribution on some finite set
D
, there exists an optimal
prefix code
C
on
D
such that the length of the code for t
∈
D
, denoted by L
(
t
)
is
given by
L
(
t
)
=−
log (
P
(
t
))
.
Moreover, this code is optimal in the sense that it gives the smallest expected code
size for data sets drawn according to P .
(
For the proof, please refer to Theorem 5.4.1
in
[
12
]
.
)
The optimality property means that we introduce no bias using this code length.
The probability distribution induced by a cover function is, of course, given by the
relative usage frequency of each of the patterns.
To determine this, we need to know how often a certain code is used. We define
the
usage
count of a pattern
X
where
X
occurs in its cover. Normalized, this frequency represents the probability that that
code is used in the encoding of an arbitrary
t
∈
CT
as the number of tuples
t
from
D
∈
D
. The optimal code length [
40
]
then is
log of this probability, and a code table is optimal if all its codes have their
optimal length.
More formally, we have the following definition.
−
Definition 8.7
Let
D
be a database drawn from a tuple universe
T
,
C
a prefix code,
X
C
cover a cover function, and CT a code table over
and
. The
usage
count of a
∈
pattern X
CT is defined as
usage
D
(
X
)
=|{
t
∈
D
|
X
∈
cover
(
CT
,
t
)
}|
.
This implies a
probability
distribution over the usage of patterns X
∈
CT in the
cover of
D
by CT , which is given by
usage
D
(
X
)
Y
∈
CT
usage
D
(
Y
)
.
|
D
=
P
(
X
,
CT
)
The code
(
X
|
CT
)
for X
∈
CT is
optimal
for
D
iff
L
(
code
(
X
|
CT
))
=|
code
(
X
|
CT
)
|=−
log (
P
(
X
|
D
,
CT
))
.
A code table CT is
code-optimal
for
D
iff all its codes,
{
code
(
X
|
CT
)
|
X
∈
CT
}
,
are optimal for
.
From now onward we assume that code tables are code-optimal for the database
they are induced on.
D
Example 8.8 Figure
8.1
shows usage counts for all itemsets in the code table. For
example, itemset
is used 5 times in the cover of the database. These usage
counts are used to compute optimal code lengths. For X
{
A
,
B
,
C
}
={
}
A
,
B
,
C
:
5
8
P
(
X
|
D
,
CT
)
=
log
5
8
L
(
code
(
X
|
CT
))
=−
=
0
.
68