Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

Fig. 8.2 Example database,

cover and encoded database

obtained by using the code

table shown in Fig. 8.1 .

I ={ A , B , C }

Database

Cover with CT

Encoded database

AB

C

AB

C

AB

C

AB

C

AB

C

AB

C

AB

C

AB

C

AB

B

C

AB

B

C

A

B

By not allowing itemsets to overlap, it is always unambiguous what the cover of a

transaction is. If overlap would be allowed, it can easily happen that multiple covers

are possible and computing and testing all of them would be a computational burden.

To ensure that each code table is 'valid', each CT is required to contain at least

all singleton itemsets from

I

, i.e., PS

⊇{{

i

}|

i

∈ I }

. This way, any transaction

t

.

Figure 8.2 shows an example database consisting of 8 itemsets, of which 5 are

identical. Also shown is the cover of this database with the example code table from

Fig. 8.1 . In this example, each transaction is covered by only a single itemset from the

code table, resulting in very good compression. Obviously it is often not the case that

complete transactions can be covered with a single itemset. For example, if itemset

{

∈ P

(

I

) can always be covered by any CT

∈ CT

had not been in the code table, the first five transactions would have been

covered by

ABC

}

{

AB

}

and

{

C

}

.

To encode a database

D

using code table CT we simply replace each tuple t

∈ D

by the codes of the patterns in the cover of t ,

t →{ code ( X | CT )

| X ∈ cover ( CT , t )

} .

Note that to ensure that we can decode an encoded database uniquely we assume

that

C

is a prefix code , in which no code is the prefix of another code [ 12 ].

Example 8.5 Figure 8.2 shows how the cover of a database can be translated into

an encoded database: replace each itemset in the cover by its associated code.

3.2.2

Computing Encoded Lengths

Since MDL is concerned with the best compression, the codes in CT should be cho-

sen such that the most often used code has the shortest length. That is, we should use

optimal prefix codes. As there exists a nice correspondence between code lengths

and probability distributions (see, e.g., [ 40 ]), the optimal code lengths can be cal-

culated through the Shannon entropy. In MDL we are only interested in measuring

complexity, and not in materialized codes. As such we do not have to require round

code lengths, nor do we have to operate an actual prefix coding scheme such as

Shannon-Fano or Huffman encoding.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home