Compact Representations of Sequential Classification Rules - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

In large or highly correlated datasets, rule extraction algorithms have to

deal with the combinatorial explosion of the solution space. To cope with this

problem, pruning of the generated rule set based on some quality indexes (e.g.,

confidence, support, and χ 2 ) is usually performed. In this way rules which are

redundant from a functional point of view [11, 19] are discarded. A different

approach consists in generating equivalent representations [7] that are more

compact, without information loss.

In this chapter we propose two compact forms to represent sets of sequen-

tial classification rules. The first compact form is based on the concept of

generator sequence, which is an extension to sequential patterns of the con-

cept of generator itemset [23]. Based on generator sequences, we define general

sequential rules. The collection of all general sequential rules extracted from a

dataset represents a sequential classification rule cover. A rule cover encodes

all useful classification information in a sequential rule set (i.e., is equivalent

to it for classification purposes). However, it does not allow the regeneration

of the complete rule set.

The second proposed compact form exploits jointly the concepts of closed

sequence and generator sequence. While the notion of generator sequence, to

our knowledge, is new, closed sequences have been introduced in [29,31]. Based

on closed sequences, we define closed sequential rules. A closed sequential rule

is the most specialistic (i.e., characterized by the longest sequence) rule into

a set of equivalent rules. To allow regeneration of the complete rule set, in the

compact form each closed sequential rule is associated to the complete set of

its generator sequences.

To characterize our compact representations, we first define a general

framework for sequential rule mining under different types of constraints. Con-

strained sequence mining addresses the extraction of sequences which satisfy

some user defined-constraints. Example of constraints are minimum or maxi-

mum gap between events [5,17,18,21,25], sequence length or regular expression

constraints over a sequence [16, 25]. We characterize the two compact forms

within this general framework.

We then define a specialization of the proposed framework which addresses

the maximum gap constraint between consecutive events in a sequence. This

constraint is particularly interesting in domains where there is high correlation

between neighboring elements, but correlation rapidly decreases with distance.

Examples are the biological application domain (e.g., the analysis of DNA

sequences), text analysis, web mining. In this context, we present an algorithm

for mining our compact representations.

The chapter is organized as follows. Section 2 introduces the basic con-

cepts and notation for the sequential rule mining task, while Sect. 3 presents

our framework for sequential rule mining. Sections 4 and 5 describe the com-

pact forms for sequences and for sequential rules, respectively. In Sect. 6 the

algorithm for mining our compact representations is presented, while Sect. 7

reports experimental result on the compression effectiveness of the proposed

techniques. Section 8 discusses previous related work. Finally, Sect. 9 draws

some conclusions and outlines future work.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home