Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

The flat data format (relational or vectorial data) was proven to be acceptable and

successful when utilized with many well-established data mining techniques. Thus,

an effective way proposed in [ 14 ] known as Database Structure Model (DSM) is

utilized in this research to represent tree-structured data in a structure-preserving flat

data format. This approach offers a way of preserving tree-structured and attribute-

value information. With the application of DSM, the structural characteristics are

preserved during the data mining process. The extracted rules from the data mining

application can be mapped onto the DSM to re-generate the pre-order string encoding

of subtrees.

Let a tree structure data in flat table format (FDT) dataset be denoted as D ,

i 1 ,

i 2 ,...,

| }

the set of distinct items in D , AT

at 1 ,

at 2 ,...,

| }

the

set of input attributes in D , and Y

the class attribute with a set

of class labels in D . Assume that D contains a set of n records D

y 1 ,

y 2 ,...,

| }

x r ,

y r }

where x r

ↆ

I is an item or a set of items and y r

∈

Y is a class label, then

x r |=

contains the attribute names and

corresponding values for record r in D for each attribute at in AT . The training

dataset is denoted as D tr

and x r

at 1 val r ,

at 2 val r ,...,

at | AT | val r }

ↆ

D and the testing dataset as D ts

ↆ

D , and filtered

database after feature selection as D tr where I ↆ

I .

We extracted the rule sets extracted from the flat table format (FDT) satisfying

the minimum support and confidence threshold (denoted as F

(

)

). Individual rules

are denoted as fA

∈

(

)

,oftheform x

ₒ

y , where x is the antecedent and y

D tr ,

the consequent,

∃{

x r ,

y r }∈

ↆ

x r ,

x r

at 1 val r ,

at 2 val r ,...,

at | AT | val r }

and

∈

(

)

, SAS Enterprise Miner software was used.

Feature Subset Selection : The Symmetrical Tau (ST) measure [ 54 ] was derived

from the Goodman's and Kruskal's Asymmetrical Tau measure of association for

cross-classification tasks in the statistical domain. Zhou and Dillon [ 54 ]haveused

the Asymmetrical Tau measure as feature selection during decision tree building, and

have found that it tends to favour attributes with more values. When the classes of an

attribute A are increased by class subdivision, more is known about attribute A and

the probability error in predicting the class of another attribute B may decrease. On

the other hand, attribute A becomes more complex, potentially causing an increase

in the probability error in predicting its category according to the category of B.

This trade off effect inspired Zhou and Dillon [ 54 ] to combine the two asymmetrical

measures in order to obtain a balanced feature selection criterion which is in turn

symmetrical. However, note that in case of Boolean variables, symmetrical and asym-

metrical tau will have the same value. Some powerful properties of ST, as reported

in [ 54 ], are noise handling through built-in statistical strength, potential classifica-

tion uncertainties are conveyed through dynamic error estimation, no bias towards

multi-valued attributes, not proportional to sample size, proportional-reduction-in-

error nature allows measuring of sequential variation in predictive capability, and

handling of Boolean combinations of logical features.

Let there be R rows and C columns in the contingency table for attributes at i

and Y . The probability that an individual belongs to row category r and column

category c is represented as P

Y is a class label. For generating F

are the marginal prob-

abilities in row category r and column category c respectively. The measure is

(

)

, and P

(

+ )

and P

( +

)

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home