Compact Representations of Sequential Classification Rules - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

order to associate each generator sequence to the appropriate closed sequence.

For example, in the Reuters dataset when minsup =0 . 1% and maxgap goes

from 3 to 5, CPU times range from 22 min up to 1.80 h. In the Newsgroup

dataset, CPU time is always lower than 1 min. However, when we consider

maxgap > 4 and supports lower than 2.5% more than 10 h are required.

We compared our time performance with the algorithm proposed in [17], an

e cient state-of-the-art algorithm for constrained sequence mining. Zaki [17]

does not address the extraction of closed and generator sequences. The code

was downloaded from [1]. To optimize memory usage, [17] partitions the search

space and analyzes each partition independently. The same approach can not

be applied in our context due to the type of search we want to perform. When

the required memory is not very large, the two algorithms provide comparable

performance. Otherwise, [17] yields better performance. For example, when

large amounts of memory are required, [17] runs up to five times faster for the

Reuters dataset.

8 Related Work

Sequential pattern mining is a relevant research area with applications in a va-

riety of different contexts. Examples of sequential data include text data, DNA

sequence, web log files, and customer purchase transactions. The problem of

mining sequential patterns has been introduced in [4]. It has been further

studied in [20, 24, 27, 35], where different sequence extraction algorithms have

been proposed.

Several types of user-defined constraints on sequential pattern have also

been considered. For example, minimum or maximum gap constraint between

consecutive events or between the first and the last event in the sequence

have been addressed in [5, 17, 18, 21, 25]. Alternative constraints for sequence

mining are relative to sequence length, occurrence of a set of items within the

sequence, or regular expression constraints over a sequence [16, 25].

For classification datasets, where each input sequence is characterized by a

class label, sequential classification rules have been introduced in [17]. These

rules allow modeling class properties and have been exploited for instance in

the web context to predict users' next requests and behavior (e.g., to predict

the next request for a web document) [32], or in the biological domain for

example to predict protein properties [26].

When low support thresholds are considered or the dataset is highly cor-

related, a huge set of sequences may be generated, until the problem becomes

intractable. To deal with the generation of a large solution set, in the context

of association rule mining a significant effort has been devoted to define con-

cise representations for frequent itemsets and association rules. For frequent

itemsets these forms are based on the concepts of maximal itemsets [10], closed

itemsets [23,34], free sets [12], disjunction-free generators [13], and deduction

Search WWH ::

Custom Search

Home