Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

Theorem 8.6

Let P be a distribution on some finite set

, there exists an optimal

prefix code

such that the length of the code for t

∈ D

, denoted by L ( t ) is

given by

L ( t )

=−

log ( P ( t )) .

Moreover, this code is optimal in the sense that it gives the smallest expected code

size for data sets drawn according to P . ( For the proof, please refer to Theorem 5.4.1

in [ 12 ] . )

The optimality property means that we introduce no bias using this code length.

The probability distribution induced by a cover function is, of course, given by the

relative usage frequency of each of the patterns.

To determine this, we need to know how often a certain code is used. We define

the usage count of a pattern X

where X

occurs in its cover. Normalized, this frequency represents the probability that that

code is used in the encoding of an arbitrary t

∈

CT as the number of tuples t from

∈ D

. The optimal code length [ 40 ]

then is

log of this probability, and a code table is optimal if all its codes have their

optimal length.

More formally, we have the following definition.

−

Definition 8.7

Let

be a database drawn from a tuple universe

a prefix code,

cover a cover function, and CT a code table over

and

. The usage count of a

∈

pattern X

CT is defined as

usage D ( X )

=|{

∈ D |

∈

cover ( CT , t )

This implies a probability distribution over the usage of patterns X

∈

CT in the

cover of

by CT , which is given by

usage D ( X )

Y ∈ CT usage D ( Y ) .

| D

P ( X

, CT )

The code ( X

CT ) for X

∈

CT is optimal for

iff

L ( code ( X | CT ))

=| code ( X | CT )

|=−

log ( P ( X | D

, CT )) .

A code table CT is code-optimal for

iff all its codes,

{

code ( X

CT )

∈

}

are optimal for

From now onward we assume that code tables are code-optimal for the database

they are induced on.

Example 8.8 Figure 8.1 shows usage counts for all itemsets in the code table. For

example, itemset

is used 5 times in the cover of the database. These usage

counts are used to compute optimal code lengths. For X

{

A , B , C

}

A , B , C

P ( X

| D

, CT )

log 5

L ( code ( X | CT ))

=−

0 . 68

Frequent Pattern Mining

Search WWH ::

Custom Search

Home