Information Technology Reference
In-Depth Information
Re-generate the Pre- order
String Encoding of Subtrees
Database
Structure
Model ( DSM )
Association Rules
with Class Variable
based on :
i)FullTr ee
ii)Embedded Subtrees
iii)Induced Subtrees
Extraction
of DSM
Generation of Flat
Data Format (FDT)
Mapping Association
Rules to DSM
Feature
Subset
Filtered
Dataset
)
Selection :
Symmetrical Tau (ST)
Training
Dataset
)
Association
Rules w.r.t
FDT
Generation of
Association Rules
~
tr
(
D
(
D
tr
a ) Removal of Rules that
Contains Not Statistically
Significant Attribute/s
Statistical Analysis:
Chi-Squared Test
Logistic Regression(LR)
￿
XML
Documents /
Tree-
Structured
Data
Tree-
Structured
Data in Flat
Format
Pre-processing:
Missing Data
Data Tranformation
Data
Partitioning
b ) Removal of Redundant
Rules
Determination of LR
Model
￿ Testing for Significance
of the Coefficients
( D
)
c ) Removal of Contradictive
Rules
Legend:
:Process
:Rules
:Data
d ) Filtering Rules based on
Confidence Threshold
Testing
Dataset
)
Rule Verification:
￿
( ts
D
Rules Accuracy
A set of Significant
Rules
￿
Rules Coverage
Fig. 10.6 Method and experimental setup
to be binned; (2) Using the specified number of bins, calculate the boundary (width)
of each bin; (3) Using specified boundaries, assign each value of the variable to a bin
for each record. The data partitioning, missing value imputation and discretization
were performed using the SAS Enterprise Miner software (please refer to [ 35 ]for
further detail on the use of software for data pre-processing). Secondly, feature subset
selection based on attribute ranking according to Symmetrical Tau measure [ 54 ]of
predictive capability is performed as described in [ 15 ].
The association rule mining algorithm is utilized to discover frequent rules from
the FDT and rule filtering process based on sequence of chi-square test, Logistic
Regression model selection, redundant rule removal (based on minimum improve-
ment redundant rule constraint [ 4 ]) and optional filtering based on higher confidence
threshold is performed. The extracted association rules are mapped onto the DSM
(by the pre-order position of each item) to re-generate the pre-order string encoding
of subtrees, thereby representing them as subtrees of the tree database.
These rules may contain both valid and invalid subtrees (disconnected subtrees),
and we will refer to these as FullTree . In addition, the rules based on embedded sub-
trees and the rules based on induced subtrees (the rule sets that exclude disconnected
subtrees) have also been revealed within the extracted rules. Finally the rule accu-
racy and coverage rate is calculated for all rule sets at different stages.The extracted
frequent rules are mapped onto the DSM to re-generate the pre-order string encoding
of subtrees, thereby representing them as subtrees of the tree database.
Tree-Structured Data Format Conversion : For given tree-structured data, the enu-
meration of all possible subtrees in a complete, non-redundant and efficient way is
the major problem one needs to tackle [ 43 ]. A significant delay in the subtree patterns
analysis and interpretation process may occur at lower support thresholds. Addition-
ally, as a large number of frequent subtree patterns may be discovered, many of
which may not be useful, one needs to filter out many of the irrelevant/uninteresting
patterns.
 
Search WWH ::




Custom Search