Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

evaluating rule interestingness as discussed in [ 44 ]. Several works on the evaluation

of discovered patterns based on statistical significance [ 2 , 22 , 46 ] are limited to

relational data. The existence of vast well-developed measuring techniques to evalu-

ate interestingness of rules from relational data, offers great opportunities for adapt-

ing these techniques for verifying significant subtrees from semi-structured data. The

applicability of these interestingness measures needs to be explored in the context

of frequent subtree mining, where necessary adjustments and extensions need to be

made to ascertain the validity of the methods given the more complex structured

aspects in the data, which often need to be preserved in the rules.

One line of work focusing on more interesting subtree patterns aims to reduce

the patterns through the application of plausible constraints. The problem of min-

ing mutually dependent ordered subtrees has been addressed in [ 32 ]. The proposed

algorithm utilizes the hyperclique method [ 47 ] in the tree mining context so that

all the components of a subtree are highly correlated together. These hyperclique

subtree patterns are discovered using an h-confidence measure which is the mini-

mum probability of an item from a pattern in one transaction implying the presence

of all other items in the same transaction. Hence, the extracted hyperclique subtree

patterns will satisfy the minimum h-confidence threshold. The work done in [ 3 ]uses

the method proposed for database compression in regards to item set mining in [ 39 ]

to demonstrate how the same minimum description length principle can yield good

results for sequential and tree-structured data. The work presented in [ 31 ] extends

the idea of the item constraint [ 41 ] to that of a node-inclusion constraint in subtrees.

Furthermore, Knijf and Feelders [ 20 ] proposed the use of monotone constraints in

frequent subtree mining, namely monotone, anti-monotone, convertible and succinct

constraints. Using these constraints, the frequent subtrees are mined using an oppor-

tunistic pruning strategy, and the set of frequent subtrees are reduced to only those

satisfying the specific user pre-defined constraints.

Besides the aforementioned constraint-based techniques, to the best of our

knowledge, there are limited works on verifying the significance of discovered fre-

quent subtrees. Hashimoto et al. [ 19 ] proposed and developed an application of sta-

tistical hypothesis testing to re-rank the significant frequent subtrees. This approach

ranks the significant patterns according to P-values obtained from the Fisher's Exact

test of significance. The significant patterns were then used for Glycan classifi-

cations problems. Recently Yan et al. [ 48 ], proposed a mining framework called

LEAP (Descending Leap Mine) for checking and mining significant frequent sub-

graphs which helps to discard redundant frequent subgraphs. For a predefined class

label in XML documents, an efficient XRules classifier has been proposed in [ 52 ].

This approach offers promising results in terms of a structural classifier for semi-

structured data, but utilizes standard measures of interestingness based on support

and confidence.

Search WWH ::

Custom Search

Home