Databases Reference
In-Depth Information
order to associate each generator sequence to the appropriate closed sequence.
For example, in the Reuters dataset when minsup =0 . 1% and maxgap goes
from 3 to 5, CPU times range from 22 min up to 1.80 h. In the Newsgroup
dataset, CPU time is always lower than 1 min. However, when we consider
maxgap > 4 and supports lower than 2.5% more than 10 h are required.
We compared our time performance with the algorithm proposed in [17], an
e cient state-of-the-art algorithm for constrained sequence mining. Zaki [17]
does not address the extraction of closed and generator sequences. The code
was downloaded from [1]. To optimize memory usage, [17] partitions the search
space and analyzes each partition independently. The same approach can not
be applied in our context due to the type of search we want to perform. When
the required memory is not very large, the two algorithms provide comparable
performance. Otherwise, [17] yields better performance. For example, when
large amounts of memory are required, [17] runs up to five times faster for the
Reuters dataset.
8 Related Work
Sequential pattern mining is a relevant research area with applications in a va-
riety of different contexts. Examples of sequential data include text data, DNA
sequence, web log files, and customer purchase transactions. The problem of
mining sequential patterns has been introduced in [4]. It has been further
studied in [20, 24, 27, 35], where different sequence extraction algorithms have
been proposed.
Several types of user-defined constraints on sequential pattern have also
been considered. For example, minimum or maximum gap constraint between
consecutive events or between the first and the last event in the sequence
have been addressed in [5, 17, 18, 21, 25]. Alternative constraints for sequence
mining are relative to sequence length, occurrence of a set of items within the
sequence, or regular expression constraints over a sequence [16, 25].
For classification datasets, where each input sequence is characterized by a
class label, sequential classification rules have been introduced in [17]. These
rules allow modeling class properties and have been exploited for instance in
the web context to predict users' next requests and behavior (e.g., to predict
the next request for a web document) [32], or in the biological domain for
example to predict protein properties [26].
When low support thresholds are considered or the dataset is highly cor-
related, a huge set of sequences may be generated, until the problem becomes
intractable. To deal with the generation of a large solution set, in the context
of association rule mining a significant effort has been devoted to define con-
cise representations for frequent itemsets and association rules. For frequent
itemsets these forms are based on the concepts of maximal itemsets [10], closed
itemsets [23,34], free sets [12], disjunction-free generators [13], and deduction
Search WWH ::




Custom Search