Database Reference
In-Depth Information
Fig. 8.2 Example database,
cover and encoded database
obtained by using the code
table shown in Fig. 8.1 .
I ={ A , B , C }
Database
Cover with CT
Encoded database
AB
C
AB
C
AB
C
AB
C
AB
C
AB
C
AB
C
AB
C
AB
B
C
AB
B
C
A
A
A
A
B
B
By not allowing itemsets to overlap, it is always unambiguous what the cover of a
transaction is. If overlap would be allowed, it can easily happen that multiple covers
are possible and computing and testing all of them would be a computational burden.
To ensure that each code table is 'valid', each CT is required to contain at least
all singleton itemsets from
I
, i.e., PS
⊇{{
i
}|
i
I }
. This way, any transaction
t
.
Figure 8.2 shows an example database consisting of 8 itemsets, of which 5 are
identical. Also shown is the cover of this database with the example code table from
Fig. 8.1 . In this example, each transaction is covered by only a single itemset from the
code table, resulting in very good compression. Obviously it is often not the case that
complete transactions can be covered with a single itemset. For example, if itemset
{
P
(
I
) can always be covered by any CT
CT
had not been in the code table, the first five transactions would have been
covered by
ABC
}
{
AB
}
and
{
C
}
.
To encode a database
D
using code table CT we simply replace each tuple t
D
by the codes of the patterns in the cover of t ,
t →{ code ( X | CT )
| X cover ( CT , t )
} .
Note that to ensure that we can decode an encoded database uniquely we assume
that
C
is a prefix code , in which no code is the prefix of another code [ 12 ].
Example 8.5 Figure 8.2 shows how the cover of a database can be translated into
an encoded database: replace each itemset in the cover by its associated code.
3.2.2
Computing Encoded Lengths
Since MDL is concerned with the best compression, the codes in CT should be cho-
sen such that the most often used code has the shortest length. That is, we should use
optimal prefix codes. As there exists a nice correspondence between code lengths
and probability distributions (see, e.g., [ 40 ]), the optimal code lengths can be cal-
culated through the Shannon entropy. In MDL we are only interested in measuring
complexity, and not in materialized codes. As such we do not have to require round
code lengths, nor do we have to operate an actual prefix coding scheme such as
Shannon-Fano or Huffman encoding.
Search WWH ::




Custom Search