Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

Fig. 8.1 Example code table.

The widths of the codes

represent their lengths.

I ={ A , B , C }

Code table CT

Itemset

Usage

Code

. Note that the

usage column is not part of

the code table, but shown here

as illustration: for optimal

compression, codes should be

shorter the more often they

are used

A

B

C

5

A

1

A

1

B

1

C

0

Next, we describe both the general approach, as well as cover a specific instance

for transaction data. First, we formally define a code table.

X

C

Definition 8.1

Let

be a set of patterns and

a set of code words. A code table

X

C

CT over

is a two-column table such that:

1. The first column contains patterns, that is, elements from

and

X

.

2. The second column contains elements from

C

, such that each element of

C

occurs

at most once.

We write code ( X | CT ) for the code corresponding to a pattern X ∈ CT . Further,

we say PS for

{

X

∈

CT

}

, the pattern set of CT .

Example 8.2 Throughout we will use KRIMP [ 57 , 68 ] as a running example. It was

the first pattern set mining method using code tables and MDL, and considers itemset

data. In all examples, a dataset

D

is a bag of transactions over a set of items

I

, i.e.,

for each t

. Patterns are also itemsets and the pattern language

is the set of all possible itemsets, i.e.,

∈ D

we have t

⊆ I

2 I ={

.

Figure 8.1 shows an example KRIMP code table of five patterns. The left column

lists the itemsets, the second column contains the codes. Each bar represents a code,

its width represents the code length. (Note, these are obviously not real codes, but

a simplified representation; for our purposes representing code lengths suffices.)

The usage column is not part of the code table, but only used to determine the code

lengths.

X =

X

⊆ I }

3.2.1

Encoding the Data

D

Given a dataset

with

CT . As already mentioned, encoding a dataset is done by replacing occurrences of

patterns in the code table by their corresponding codes. To achieve lossless encoding

of the data, we need to cover the complete dataset with patterns from pattern set PS .

In practice, covering a dataset is usually done on a per-tuple basis, such that each

tuple is covered by a subset of the patterns in the code table. Hence, a code table

normally has all three properties discussed in the previous subsection: it allows for

dataset-level compression , tuple-level compression , and sub-tuple inspection .

and a code table CT , we need to define how to encode

Frequent Pattern Mining

Search WWH ::

Custom Search

Home