Information Technology Reference
In-Depth Information
evaluating rule interestingness as discussed in [ 44 ]. Several works on the evaluation
of discovered patterns based on statistical significance [ 2 , 22 , 46 ] are limited to
relational data. The existence of vast well-developed measuring techniques to evalu-
ate interestingness of rules from relational data, offers great opportunities for adapt-
ing these techniques for verifying significant subtrees from semi-structured data. The
applicability of these interestingness measures needs to be explored in the context
of frequent subtree mining, where necessary adjustments and extensions need to be
made to ascertain the validity of the methods given the more complex structured
aspects in the data, which often need to be preserved in the rules.
One line of work focusing on more interesting subtree patterns aims to reduce
the patterns through the application of plausible constraints. The problem of min-
ing mutually dependent ordered subtrees has been addressed in [ 32 ]. The proposed
algorithm utilizes the hyperclique method [ 47 ] in the tree mining context so that
all the components of a subtree are highly correlated together. These hyperclique
subtree patterns are discovered using an h-confidence measure which is the mini-
mum probability of an item from a pattern in one transaction implying the presence
of all other items in the same transaction. Hence, the extracted hyperclique subtree
patterns will satisfy the minimum h-confidence threshold. The work done in [ 3 ]uses
the method proposed for database compression in regards to item set mining in [ 39 ]
to demonstrate how the same minimum description length principle can yield good
results for sequential and tree-structured data. The work presented in [ 31 ] extends
the idea of the item constraint [ 41 ] to that of a node-inclusion constraint in subtrees.
Furthermore, Knijf and Feelders [ 20 ] proposed the use of monotone constraints in
frequent subtree mining, namely monotone, anti-monotone, convertible and succinct
constraints. Using these constraints, the frequent subtrees are mined using an oppor-
tunistic pruning strategy, and the set of frequent subtrees are reduced to only those
satisfying the specific user pre-defined constraints.
Besides the aforementioned constraint-based techniques, to the best of our
knowledge, there are limited works on verifying the significance of discovered fre-
quent subtrees. Hashimoto et al. [ 19 ] proposed and developed an application of sta-
tistical hypothesis testing to re-rank the significant frequent subtrees. This approach
ranks the significant patterns according to P-values obtained from the Fisher's Exact
test of significance. The significant patterns were then used for Glycan classifi-
cations problems. Recently Yan et al. [ 48 ], proposed a mining framework called
LEAP (Descending Leap Mine) for checking and mining significant frequent sub-
graphs which helps to discard redundant frequent subgraphs. For a predefined class
label in XML documents, an efficient XRules classifier has been proposed in [ 52 ].
This approach offers promising results in terms of a structural classifier for semi-
structured data, but utilizes standard measures of interestingness based on support
and confidence.
Search WWH ::




Custom Search