Compact Representations of Sequential Classification Rules - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

Compact rules

rule sup% conf%

( {A},A ) → c 1 66.66 66.66

( {A},A ) → c 2 33.33 33.33

( {B},B ) → c 1 33.33 50.00

( {B},B ) → c 2 33.33 50.00

( {E},AE ) → c 1 33.33 100.00

( {AB, E},ABE ) → c 1 33.33 100.00

( {C},ACA ) → c 1 33.33 50.00

( {C},ACA ) → c 2 33.33 50.00

( {DA},ADA ) → c 1 33.33 100.00

( {CB,BA},ACBA ) → c 2 33.33 100.00

( {DB, BA},ADBA ) → c 2 33.33 100.00

( {D, C},ADCA ) → c 1

General rules

rule sup% conf%

A → c 1 66.66 66.66

A → c 2 33.33 33.33

B → c 1 33.33 50.00

B → c 2 33.33 50.00

C → c 1 33.33 50.00

C → c 2 33.33 50.00

D → c 1 33.33 50.00

D → c 2 33.33 50.00

E → c 1 33.33 100.00

AB → c 1 33.33 100.00

BA → c 2 33.33 100.00

CB → c 2 33.33 100.00

DA → c 1 33.33 100.00

DB → c 2 33.33 100.00

(b) CRC set

33.33 50.00

( {D, C},ADCA ) → c 2

33.33 50.00

( {CB},ADCBA ) → c 2

33.33 100.00

(a) CCRS set

Fig. 4. Compact representations

2 which are not generators inherit generators from

their subsequences with the same support. For example, sequence BE contains

sequence E ,and BE and E have equal support. Hence, we add to

Sequences in set

( BE )all

sequences in set G ( E ) (i.e., E ).

By iteratively applying the algorithm, we generate set M

3 , which includes

2 with itself . For instance, we gen-

erate sequence DCA from sequences DC and CA . DCA has the same support

as both CA and DC . Hence, DCA is not a generator sequence. Instead, it

inherits generators from both CA and DC . Hence

all sequences with length=3, by joining

( DCA )=

{

D,C

}

3 does not contribute to the CRC set, since none of its elements

is a generator sequence. For set

Set

2 , only sequence AE is a closed sequence.

Hence, it generates the compact rule (

c 1 .

Figure 4 reports the CRC and CCRS sets for our example dataset.

{

}

,AE )

→

7 Experimental Results

Experiments have been run to evaluate both the compression achievable

by means of the proposed compact representations and the performance of

the proposed algorithm. To run experiments we considered three datasets.

Reuters-21578 news and NewsGroups datasets [2] include textual data. DNA

dataset includes collections of DNA sequences [2]. Table 2 reports the number

of items, sequences, and class labels for each dataset. For Reuters and News-

Grousp datasets items correspond to words in a text. For DNA dataset items

correspond to four aminoacid symbols. Table 2 also shows the maximum,

minimum and average length of sequences in the datasets.

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home