Databases Reference
In-Depth Information
Tabl e 2 .
Datasets
Dataset
Sequences
Sequence length
Items
Classes
Min
Max
Avg
Reuters-21578
6,454
4
371
52.03
27,600
10
NewsGroup
2,000
83
21,691
303.97
41,420
20
DNA
2,000
60
60
60
4
3
We ran experiments with different support threshold values (denoted
minsup
) and for different maximum gap values (denoted
maxgap
). Exper-
iments were run on an Intel P4 with 2.8 GHz CPU clock rate and 2 GB RAM.
The
CompactForm Miner
algorithm has been implemented in ANSI C.
7.1 Compression Factor
Let
be the set of all rules which satisfy both
minsup
and
maxgap
con-
straints and
CRC
and
CCRS
the set of general rules and compact rules
satisfying the same constraints. To measure the compression factor achieved
by our compact representations, we compare their size with the size of the
complete rule set. The compression factor (CF%) for the two representations
is respectively (1
R
−
|
CRC
|
|R|
−
|
CCRS
|
|R|
)%.
For the
CRC
representation, a high compression factor indicates that rules
whose antecedent is a generator sequence are a small fraction of
)% and (1
. Instead,
for the
CCRS
representation, a high compression factor indicates that rules
whose antecedent is a closed sequence are a small fraction of
R
R
. In both cases,
a small subset of
encodes all useful information to model classes.
Different data distributions yield a different behavior when varying
minsup
and
maxgap
values. In the following we summarize some com-
mon behaviors. Then, we analyze each dataset separately and discuss it in
detail.
For moderately high
minsup
values, the two representations have a very
close size (or even exactly the same size). In this case, the subsets of rules in
R
R
having as antecedent a closed sequence or a generator sequence are almost
the same.
When lowering the support threshold or increasing the
maxgap
value, the
number of rules in set
and in sets
CCRS
and
CRC
increases significantly.
In this case, the
CRC
representation often achieves a higher compression than
the
CCRS
representation. This effect occurs for
maxgap >
1 and low
minsup
values. In this case, the set of rules with a generator sequence as antecedent is
smaller than the set of rules with a closed sequence as antecedent. This occurs
because when increasing
maxgap
or decreasing
minsup
, mined sequences are
characterized by increasing length. Hence, the number of closed sequences,
which are the sequences with the longest antecedent, increases significantly.
Instead, the increase in the number of generator sequences, which have shorter
R