Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation - Feature Selection for Data and Pattern Recognition - page 217

Information Technology Reference

In-Depth Information

Table 10.14 Academic Institution flattened data characteristics and initial number of rules for

varying support

Support threshold (%)

Atrr. #

# Selected attr.

# of Rules with target attr.

DSM flat

Sym. Tau

FullTree

Embedded

Induced

1

442

217

-

-

-

5

126

123

28282

28282

28282

10

70

63

234

234

234

20

36

29

50

49

49

30

26

19

14

13

13

to trees as was explained with the illustrative example in Sect. 3.1 . The resulting

dataset had 18,836 instances, of which 66% was used for training and the remainder

for testing. The details of the setting of the WebLogs access can be found in [ 16 ].

The general characteristics of the flat data format (including backtrack attributes)

and initial number of rules extracted for education institution data (50% minimum

confidence) at varying support thresholds is provided in Table 10.14 .

In this dataset, similar to the experiments described in Sect. 10.5.2 , rules from

FullTree , Embedded and Induced rule sets have been progressively assessed with

statistical analysis and redundancy assessment method. The results demonstrate that

the conversion of the original tree-structured data into the flat data format represen-

tation, created a very large number of input attributes, especially at lower support

thresholds. By utilizing the Apriori algorithm to generate all frequent rules, one

might encounter difficulties in analyzing all rules given certain support and confi-

dence constraints.

By referring to the Table 10.15 , even with the given support constraint, the num-

ber of extracted rules (Initial Rule Set) is large. A large volume of rules may be

discovered due to the presence of irrelevant attributes in the dataset. The capabilities

of ST in selecting appropriate attributes, thereby removing irrelevant attributes, are

shown in our previous experiments for relational data problems. For this particular

task of evaluating tree-structured rules, similar experiments were conducted. The

attributes for each different support were ranked according to their decreasing ST

and a relevance cut-off point was chosen.

Table 10.15 indicates the differences between the number of initial input attributes

and the number of attributes after applying Symmetrical Tau (ST) with their respec-

tive rule number (below) for each dataset for each different support. All attributes that

have been removed from the WebLogs data are backtrack attributes. This indicates

that the inclusion of these backtrack nodes may not be useful or have low capabilities

in predicting the class attributes in this dataset.The input variable that contains a sin-

gle value is unable to distinguish the class variables. Such input attributes have been

discarded as they are considered irrelevant based on the ST value calculated. With the

application of ST feature selection technique, rules that contain attributes that failed

the STmeasure are discarded. The large number of rules were managed to be reduced

Next Page

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home