Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation - Feature Selection for Data and Pattern Recognition - page 198

Information Technology Reference

In-Depth Information

T:

T

:

T

:

T

:

5

0

T:6

'b'

0

'b'

0

'b'

0

'b'

0

'b'

7

'b'

1

'b'

2

'e'

5

'c'

6

'c'

7

'b'

6

'c'

7

'b'

3

'e'

7

'b'

2

'e'

5

'c'

8

'e'

4

'c'

3

'e'

4

'c'

8

'e'

8

'e'

8

'e'

T:3

T:4

0

'b'

0

'b'

3

4

'e'

'c'

6

'c'

7

'b'

Fig. 10.1 Example of induced/embedded subtrees ( T 1 , T 2 , T 4 , T 6 ) and embedded subtrees ( T 3 , T 5 )

of tree T

In this chapter, the focus is on evaluating rules based on embedded and induced

subtrees that satisfy minimum support and confidence thresholds, and discarding

any rules determined to be irrelevant to the classification task at hand. Let us denote

the subtree patterns from the frequent subtree set SF that have a class label (value),

as SFC , their accuracy as ac

. The problem

focused on in this work can then be generally defined as follows: Given SFC with

accuracy ac ( SFC ), obtain SFC ↆ

(

SFC

)

and coverage rate as cr

(

SFC

)

SFC ) ≥ (

SFC , such that ac

(

ac

(

SFC

) − ʵ)

and

SFC ) ≥ (

cr

is an arbitrary user defined small value used to reflect

the noise that is often present in real-world data).

In what follows we discuss the common way of representing trees. This will

lay the necessary ground for understanding the positional constraint imposed by

the DSM approach [ 14 ]. A pre-order traversal can be computed as follows: If an

ordered tree T consists only of a root node r , then r is the pre-order traversal of T .

Otherwise let T 1 ,

(

cr

(

SFC

) − ʵ)

(

ʵ

T n be the subtrees occurring at r from left to right in T .

The pre-order traversal begins by visiting r and then traversing all the remaining

subtrees in pre-order starting from T 1 and finishing with T n . The string encoding

(

T 2 ,...,

˕

) can be generated by adding vertex labels in the pre-order traversal of a tree

T

L ) whenever we

backtrack from a child node to its parent node. Figure 10.2 and Table 10.1 depict a

tree database consisting of 7 tree instances (or transactions) and the string encoding

for tree database, respectively.

= (

v 0 ,

V

,

L

,

E

)

and appending a backtrack symbol (e.g., '/', '/'

∈

10.3.1 Feature Subset Selection

Feature subset selection is an important pre-processing step in the data mining

process. If the irrelevant attributes are left in the dataset, they can interfere with

the data mining process and the quality of the discovered patterns may deteriorate,

creating problems such as overfitting [ 9 ]. It is in particular the case in associative

classifiers, since frequent patterns are typically generated without considering their

predictive power [ 9 ], resulting in a huge feature space for possible frequent patterns.

The removal of irrelevant attributes will result in a much smaller dataset, thereby

Next Page

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home