Database Reference
In-Depth Information
Theorem 8.6
Let P be a distribution on some finite set
D
, there exists an optimal
prefix code
C
on
D
such that the length of the code for t
D
, denoted by L ( t ) is
given by
L ( t )
=−
log ( P ( t )) .
Moreover, this code is optimal in the sense that it gives the smallest expected code
size for data sets drawn according to P . ( For the proof, please refer to Theorem 5.4.1
in [ 12 ] . )
The optimality property means that we introduce no bias using this code length.
The probability distribution induced by a cover function is, of course, given by the
relative usage frequency of each of the patterns.
To determine this, we need to know how often a certain code is used. We define
the usage count of a pattern X
where X
occurs in its cover. Normalized, this frequency represents the probability that that
code is used in the encoding of an arbitrary t
CT as the number of tuples t from
D
D
. The optimal code length [ 40 ]
then is
log of this probability, and a code table is optimal if all its codes have their
optimal length.
More formally, we have the following definition.
Definition 8.7
Let
D
be a database drawn from a tuple universe
T
,
C
a prefix code,
X
C
cover a cover function, and CT a code table over
and
. The usage count of a
pattern X
CT is defined as
usage D ( X )
=|{
t
D |
X
cover ( CT , t )
}|
.
This implies a probability distribution over the usage of patterns X
CT in the
cover of
D
by CT , which is given by
usage D ( X )
Y CT usage D ( Y ) .
| D
=
P ( X
, CT )
The code ( X
|
CT ) for X
CT is optimal for
D
iff
L ( code ( X | CT ))
=| code ( X | CT )
|=−
log ( P ( X | D
, CT )) .
A code table CT is code-optimal for
D
iff all its codes,
{
code ( X
|
CT )
|
X
CT
}
,
are optimal for
.
From now onward we assume that code tables are code-optimal for the database
they are induced on.
D
Example 8.8 Figure 8.1 shows usage counts for all itemsets in the code table. For
example, itemset
is used 5 times in the cover of the database. These usage
counts are used to compute optimal code lengths. For X
{
A , B , C
}
={
}
A , B , C
:
5
8
P ( X
| D
, CT )
=
log 5
8
L ( code ( X | CT ))
=−
=
0 . 68
 
Search WWH ::




Custom Search