Databases Reference
In-Depth Information
Tabl e 2 . Datasets
Dataset
Sequences
Sequence length
Items
Classes
Min
Max
Avg
Reuters-21578
6,454
4
371
52.03
27,600
10
NewsGroup
2,000
83
21,691
303.97
41,420
20
DNA
2,000
60
60
60
4
3
We ran experiments with different support threshold values (denoted
minsup ) and for different maximum gap values (denoted maxgap ). Exper-
iments were run on an Intel P4 with 2.8 GHz CPU clock rate and 2 GB RAM.
The CompactForm Miner algorithm has been implemented in ANSI C.
7.1 Compression Factor
Let
be the set of all rules which satisfy both minsup and maxgap con-
straints and CRC and CCRS the set of general rules and compact rules
satisfying the same constraints. To measure the compression factor achieved
by our compact representations, we compare their size with the size of the
complete rule set. The compression factor (CF%) for the two representations
is respectively (1
R
| CRC |
|R|
| CCRS |
|R|
)%.
For the CRC representation, a high compression factor indicates that rules
whose antecedent is a generator sequence are a small fraction of
)% and (1
. Instead,
for the CCRS representation, a high compression factor indicates that rules
whose antecedent is a closed sequence are a small fraction of
R
R
. In both cases,
a small subset of
encodes all useful information to model classes.
Different data distributions yield a different behavior when varying
minsup and maxgap values. In the following we summarize some com-
mon behaviors. Then, we analyze each dataset separately and discuss it in
detail.
For moderately high minsup values, the two representations have a very
close size (or even exactly the same size). In this case, the subsets of rules in
R
R
having as antecedent a closed sequence or a generator sequence are almost
the same.
When lowering the support threshold or increasing the maxgap value, the
number of rules in set
and in sets CCRS and CRC increases significantly.
In this case, the CRC representation often achieves a higher compression than
the CCRS representation. This effect occurs for maxgap > 1 and low minsup
values. In this case, the set of rules with a generator sequence as antecedent is
smaller than the set of rules with a closed sequence as antecedent. This occurs
because when increasing maxgap or decreasing minsup , mined sequences are
characterized by increasing length. Hence, the number of closed sequences,
which are the sequences with the longest antecedent, increases significantly.
Instead, the increase in the number of generator sequences, which have shorter
R
 
Search WWH ::




Custom Search