Databases Reference
In-Depth Information
In large or highly correlated datasets, rule extraction algorithms have to
deal with the combinatorial explosion of the solution space. To cope with this
problem, pruning of the generated rule set based on some quality indexes (e.g.,
confidence, support, and χ 2 ) is usually performed. In this way rules which are
redundant from a functional point of view [11, 19] are discarded. A different
approach consists in generating equivalent representations [7] that are more
compact, without information loss.
In this chapter we propose two compact forms to represent sets of sequen-
tial classification rules. The first compact form is based on the concept of
generator sequence, which is an extension to sequential patterns of the con-
cept of generator itemset [23]. Based on generator sequences, we define general
sequential rules. The collection of all general sequential rules extracted from a
dataset represents a sequential classification rule cover. A rule cover encodes
all useful classification information in a sequential rule set (i.e., is equivalent
to it for classification purposes). However, it does not allow the regeneration
of the complete rule set.
The second proposed compact form exploits jointly the concepts of closed
sequence and generator sequence. While the notion of generator sequence, to
our knowledge, is new, closed sequences have been introduced in [29,31]. Based
on closed sequences, we define closed sequential rules. A closed sequential rule
is the most specialistic (i.e., characterized by the longest sequence) rule into
a set of equivalent rules. To allow regeneration of the complete rule set, in the
compact form each closed sequential rule is associated to the complete set of
its generator sequences.
To characterize our compact representations, we first define a general
framework for sequential rule mining under different types of constraints. Con-
strained sequence mining addresses the extraction of sequences which satisfy
some user defined-constraints. Example of constraints are minimum or maxi-
mum gap between events [5,17,18,21,25], sequence length or regular expression
constraints over a sequence [16, 25]. We characterize the two compact forms
within this general framework.
We then define a specialization of the proposed framework which addresses
the maximum gap constraint between consecutive events in a sequence. This
constraint is particularly interesting in domains where there is high correlation
between neighboring elements, but correlation rapidly decreases with distance.
Examples are the biological application domain (e.g., the analysis of DNA
sequences), text analysis, web mining. In this context, we present an algorithm
for mining our compact representations.
The chapter is organized as follows. Section 2 introduces the basic con-
cepts and notation for the sequential rule mining task, while Sect. 3 presents
our framework for sequential rule mining. Sections 4 and 5 describe the com-
pact forms for sequences and for sequential rules, respectively. In Sect. 6 the
algorithm for mining our compact representations is presented, while Sect. 7
reports experimental result on the compression effectiveness of the proposed
techniques. Section 8 discusses previous related work. Finally, Sect. 9 draws
some conclusions and outlines future work.
Search WWH ::




Custom Search