Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation - Feature Selection for Data and Pattern Recognition - page 208

Information Technology Reference

In-Depth Information

Re-generate the Pre- order

String Encoding of Subtrees

Database

Structure

Model ( DSM )

Association Rules

with Class Variable

based on :

i)FullTr ee

ii)Embedded Subtrees

iii)Induced Subtrees

Extraction

of DSM

Generation of Flat

Data Format (FDT)

Mapping Association

Rules to DSM

Feature

Subset

Filtered

Dataset

)

Selection :

Symmetrical Tau (ST)

Training

Dataset

)

Association

Rules w.r.t

FDT

Generation of

Association Rules

~

tr

(

D

(

D

tr

a ) Removal of Rules that

Contains Not Statistically

Significant Attribute/s

Statistical Analysis:

Chi-Squared Test

Logistic Regression(LR)

XML

Documents /

Tree-

Structured

Data

Tree-

Structured

Data in Flat

Format

Pre-processing:

Missing Data

Data Tranformation

Data

Partitioning

b ) Removal of Redundant

Rules

Determination of LR

Model

Testing for Significance

of the Coefficients

( D

)

c ) Removal of Contradictive

Rules

Legend:

:Process

:Rules

:Data

d ) Filtering Rules based on

Confidence Threshold

Testing

Dataset

)

Rule Verification:

( ts

D

Rules Accuracy

A set of Significant

Rules

Rules Coverage

Fig. 10.6 Method and experimental setup

to be binned; (2) Using the specified number of bins, calculate the boundary (width)

of each bin; (3) Using specified boundaries, assign each value of the variable to a bin

for each record. The data partitioning, missing value imputation and discretization

were performed using the SAS Enterprise Miner software (please refer to [ 35 ]for

further detail on the use of software for data pre-processing). Secondly, feature subset

selection based on attribute ranking according to Symmetrical Tau measure [ 54 ]of

predictive capability is performed as described in [ 15 ].

The association rule mining algorithm is utilized to discover frequent rules from

the FDT and rule filtering process based on sequence of chi-square test, Logistic

Regression model selection, redundant rule removal (based on minimum improve-

ment redundant rule constraint [ 4 ]) and optional filtering based on higher confidence

threshold is performed. The extracted association rules are mapped onto the DSM

(by the pre-order position of each item) to re-generate the pre-order string encoding

of subtrees, thereby representing them as subtrees of the tree database.

These rules may contain both valid and invalid subtrees (disconnected subtrees),

and we will refer to these as FullTree . In addition, the rules based on embedded sub-

trees and the rules based on induced subtrees (the rule sets that exclude disconnected

subtrees) have also been revealed within the extracted rules. Finally the rule accu-

racy and coverage rate is calculated for all rule sets at different stages.The extracted

frequent rules are mapped onto the DSM to re-generate the pre-order string encoding

of subtrees, thereby representing them as subtrees of the tree database.

Tree-Structured Data Format Conversion : For given tree-structured data, the enu-

meration of all possible subtrees in a complete, non-redundant and efficient way is

the major problem one needs to tackle [ 43 ]. A significant delay in the subtree patterns

analysis and interpretation process may occur at lower support thresholds. Addition-

ally, as a large number of frequent subtree patterns may be discovered, many of

which may not be useful, one needs to filter out many of the irrelevant/uninteresting

patterns.

Next Page

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home