Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

10.3 Problem Background

The problem of finding association rules x

y was first introduced in [ 1 ]asa

data mining task for finding frequently co-occurring items in large databases. Let

ₒ

be a set of items. Let D be a transactions database for which

each record/transaction R is a set of items, such that R

i 1 ,

i 2 ,...,

| }

ↆ

I . An association rule is

an implication of the form x

ₒ

y where x

ↆ

I and y

ↆ

I and x

∩

=∅

.The

absolute support of a rule x

y is the number of transactions that contain both x

and y . Typically, the relative support is used, where given the support of rule x

ₒ

(denoted as

y )) be s %, there are s % of transactions in D that contain items

(itemsets) x and y . In other words, the probability P

( x

ₒ

s %. An itemset is

frequent if it satisfies the user-specified minimum support threshold. The confidence

of a rule x

(

∪

) =

y , is the estimate of conditional probability of a transaction containing

the consequent ( y ) if the transaction contains the antecedent ( x ), and is calculated as

˃(

ₒ

Association rule discovery finds all rules that satisfy specific constraints such

as the minimum support and confidence threshold, as is the case in the Apriori

algorithm [ 1 ]. When tree-structured data such as XML is in question, the under-

lying associations are tree-structured by nature. Thus, the pre-requisite for the dis-

covery of (structural) association rules becomes the task of frequent subtree min-

ing. A tree-structured document can be modeled as a rooted ordered labeled tree.

A rooted ordered labeled tree can be denoted as T

ₒ

)/˃ (

)

is the root vertex; (2) V is the set of vertices or nodes; (3) L is a labelling function

that assigns a label L

= (

v 0 ,

)

, where (1) V 0 ∈

(

)

∈

={ (

v 1 ,

v 2 ) |

v 1 ,

v 2 ∈

to every vertex v

V ;(4) E

V AND

v 1 =

is the set of edges in the tree, and (5) for each internal nodes, the children

are ordered from left to right.

This problem is generally defined as: given a database of trees T db and minimum

support threshold

v 2 }

times in T db .Most

commonly considered subtrees are induced and embedded. The formal defini-

tions of induced and embedded subtrees are as follows [ 42 ]: Given a tree S

, find all subtrees that occur at least

(

vs 0 ,

V S ,

L S ,

E S )

and tree T

= (

vt 0 ,

V T ,

L T ,

E T )

, S is an ordered induced subtree of

T iff (1) V S

E T ; (4) the left-to-

right ordering of sibling nodes in the original tree is preserved. Moreover, S is an

ordered embedded subtree of T iff (1) V S

ↆ

V T ;(2) L S

ↆ

L T , and L S (

) =

L T (

)

;(3) E S

ↆ

V T ;(2) L S

ↆ

L T , and L S (

) =

L T (

)

;

(3) if

v 1 in S and v 1 is ancestor of v 2 in T , and

(4) the left-to-right ordering of sibling nodes in the original tree is preserved. If

(

v 1 ,

v 2 ) ∈

E S then parent

(

v 2 ) =

= (

vs 0 ,

V S ,

L S ,

E S )

is an embedded subtree of T

= (

vt 0 ,

V T ,

L T ,

E T )

, and two

vertices v 1

V S form ancestor-descendant relationship, the level of

embedding (LoE) [ 42 ], between v 1 and v 2 , denoted by

∈

V S and v 2

∈

, is defined as the

length of the path between v 1 and v 2 in T . Hence, a maximum level of embedding

constraint (MaxLoE) M can be imposed on the subtrees extracted from T , such that

any two connected nodes in an embedded subtree of T will be connected in T by a

path that has the maximum length of M . Examples of induced and embedded sub-

tree are given in Fig. 10.1 (the number on the left of the nodes indicate its pre-order

position in the original tree T ).

(

v 1 ,

v 2 )

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home