Database Reference
In-Depth Information
Fig. 8.1 Example code table.
The widths of the codes
represent their lengths.
I ={ A , B , C }
Code table CT
Itemset
Usage
Code
. Note that the
usage column is not part of
the code table, but shown here
as illustration: for optimal
compression, codes should be
shorter the more often they
are used
A
B
B
C
5
A
1
A
1
B
1
C
0
Next, we describe both the general approach, as well as cover a specific instance
for transaction data. First, we formally define a code table.
X
C
Definition 8.1
Let
be a set of patterns and
a set of code words. A code table
X
C
CT over
is a two-column table such that:
1. The first column contains patterns, that is, elements from
and
X
.
2. The second column contains elements from
C
, such that each element of
C
occurs
at most once.
We write code ( X | CT ) for the code corresponding to a pattern X CT . Further,
we say PS for
{
X
CT
}
, the pattern set of CT .
Example 8.2 Throughout we will use KRIMP [ 57 , 68 ] as a running example. It was
the first pattern set mining method using code tables and MDL, and considers itemset
data. In all examples, a dataset
D
is a bag of transactions over a set of items
I
, i.e.,
for each t
. Patterns are also itemsets and the pattern language
is the set of all possible itemsets, i.e.,
D
we have t
I
2 I ={
.
Figure 8.1 shows an example KRIMP code table of five patterns. The left column
lists the itemsets, the second column contains the codes. Each bar represents a code,
its width represents the code length. (Note, these are obviously not real codes, but
a simplified representation; for our purposes representing code lengths suffices.)
The usage column is not part of the code table, but only used to determine the code
lengths.
X =
X
I }
3.2.1
Encoding the Data
D
D
Given a dataset
with
CT . As already mentioned, encoding a dataset is done by replacing occurrences of
patterns in the code table by their corresponding codes. To achieve lossless encoding
of the data, we need to cover the complete dataset with patterns from pattern set PS .
In practice, covering a dataset is usually done on a per-tuple basis, such that each
tuple is covered by a subset of the patterns in the code table. Hence, a code table
normally has all three properties discussed in the previous subsection: it allows for
dataset-level compression , tuple-level compression , and sub-tuple inspection .
and a code table CT , we need to define how to encode
 
Search WWH ::




Custom Search