Compact Representations of Sequential Classification Rules - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

Tabl e 2 . Datasets

Dataset

Sequences

Sequence length

Items

Classes

Min

Max

Avg

Reuters-21578

6,454

4

371

52.03

27,600

10

NewsGroup

2,000

83

21,691

303.97

41,420

20

DNA

2,000

60

4

3

We ran experiments with different support threshold values (denoted

minsup ) and for different maximum gap values (denoted maxgap ). Exper-

iments were run on an Intel P4 with 2.8 GHz CPU clock rate and 2 GB RAM.

The CompactForm Miner algorithm has been implemented in ANSI C.

7.1 Compression Factor

Let

be the set of all rules which satisfy both minsup and maxgap con-

straints and CRC and CCRS the set of general rules and compact rules

satisfying the same constraints. To measure the compression factor achieved

by our compact representations, we compare their size with the size of the

complete rule set. The compression factor (CF%) for the two representations

is respectively (1

R

− | CRC |

|R|

− | CCRS |

|R|

)%.

For the CRC representation, a high compression factor indicates that rules

whose antecedent is a generator sequence are a small fraction of

)% and (1

. Instead,

for the CCRS representation, a high compression factor indicates that rules

whose antecedent is a closed sequence are a small fraction of

R

. In both cases,

a small subset of

encodes all useful information to model classes.

Different data distributions yield a different behavior when varying

minsup and maxgap values. In the following we summarize some com-

mon behaviors. Then, we analyze each dataset separately and discuss it in

detail.

For moderately high minsup values, the two representations have a very

close size (or even exactly the same size). In this case, the subsets of rules in

R

having as antecedent a closed sequence or a generator sequence are almost

the same.

When lowering the support threshold or increasing the maxgap value, the

number of rules in set

and in sets CCRS and CRC increases significantly.

In this case, the CRC representation often achieves a higher compression than

the CCRS representation. This effect occurs for maxgap > 1 and low minsup

values. In this case, the set of rules with a generator sequence as antecedent is

smaller than the set of rules with a closed sequence as antecedent. This occurs

because when increasing maxgap or decreasing minsup , mined sequences are

characterized by increasing length. Hence, the number of closed sequences,

which are the sequences with the longest antecedent, increases significantly.

Instead, the increase in the number of generator sequences, which have shorter

R

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home