Compact Representations of Sequential Classification Rules - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

From the properties of closed itemsets, it follows that a rule set contain-

ing only specialistic rules is a compact and lossless representation of

only

when anti-monotonic constraints (e.g., support constraint) are applied. This

property is lost in case of non anti-monotonic constraints (e.g., confidence

constraint). In the CCRS representation, each compact rule contains all in-

formation needed to generate all the rules encoded in it independently from

the other rules in the set. Hence, it is always possible to regenerate set

R

starting from the CCRS rule set.

6 Mining Compact Representations

In this section we present an algorithm to extract the compact rule set and

the classification rule cover representations from a sequence dataset. The al-

gorithm works in a specific instance of our framework for sequential rule min-

ing. Recall that in our framework sequence mining is constrained by the pair

( Ψ,Φ ). The set of matching functions Ψ defines the containment between a

sequence and an input-sequence. In the considered framework instance, func-

tions in Ψ yield a contiguous subsequence relation. Hence, the mined compact

representations yield contiguous closed sequences and contiguous generator

sequences. In this section, we will denote the mined sequences simply as gen-

erator or closed sequences since the contiguity constraint is assumed. Set Φ

contains all matching functions which satisfy the maximum gap constraint.

Hence, the gap constrained subsequence relation X

Φ S (where X is a se-

quence and S an input-sequence) can be formalized as X

S and X satisfies

the maximum gap constraint in S . Furthermore, for an easier readability we

denote sequence support, rule support, and rule confidence by omitting set Φ .

The proposed algorithm is levelwise [5] and computes the set of closed

and generator sequences by increasing length. At each iteration, say itera-

tion k , the algorithm performs the following operations. (1) Starting from set

M

k

of k-sequences, it generates set

M

k +1

of (k+1)-sequences. Then, (2) it

prunes from

k +1 sequences encoding only unfrequent classification rules.

This pruning method limits the number of iterations and avoids the genera-

tion of uninteresting (i.e., unfrequent) rules. (3) The algorithm checks M

M

k +1

k

against M

to identify the subset of closed sequences in M

and the subset

k +1 . (4) Based on this knowledge, the algorithm

updates the CRC and CCRS sets.

Each sequence is provided of the necessary information to support the next

iteration of the algorithm and to compute the compact representations poten-

tially encoded by it. The following information is associated to a sequence X .

(a) A sequence identifier list (denoted id-list ) recording the input-sequences

including X . The id-list is a set of triplets ( SID,eid,Class ), where SID is

the input-sequence identifier, eid is the event identifier for the first 1

of generator sequences in

M

item of

1 As discussed afterwards, knowledge about the event identifiers for the other items

in X is not necessary.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home