Information Technology Reference
In-Depth Information
10.3 Problem Background
The problem of finding association rules x
y was first introduced in [ 1 ]asa
data mining task for finding frequently co-occurring items in large databases. Let
I
be a set of items. Let D be a transactions database for which
each record/transaction R is a set of items, such that R
={
i 1 ,
i 2 ,...,
i
| }
|
I
I . An association rule is
an implication of the form x
y where x
I and y
I and x
y
=∅
.The
absolute support of a rule x
y is the number of transactions that contain both x
and y . Typically, the relative support is used, where given the support of rule x
y
(denoted as
y )) be s %, there are s % of transactions in D that contain items
(itemsets) x and y . In other words, the probability P
˃
( x
s %. An itemset is
frequent if it satisfies the user-specified minimum support threshold. The confidence
of a rule x
(
x
y
) =
y , is the estimate of conditional probability of a transaction containing
the consequent ( y ) if the transaction contains the antecedent ( x ), and is calculated as
˃(
.
Association rule discovery finds all rules that satisfy specific constraints such
as the minimum support and confidence threshold, as is the case in the Apriori
algorithm [ 1 ]. When tree-structured data such as XML is in question, the under-
lying associations are tree-structured by nature. Thus, the pre-requisite for the dis-
covery of (structural) association rules becomes the task of frequent subtree min-
ing. A tree-structured document can be modeled as a rooted ordered labeled tree.
A rooted ordered labeled tree can be denoted as T
x
y
)/˃ (
x
)
V
is the root vertex; (2) V is the set of vertices or nodes; (3) L is a labelling function
that assigns a label L
= (
v 0 ,
V
,
L
,
E
)
, where (1) V 0
(
)
={ (
v 1 ,
v 2 ) |
v 1 ,
v 2
v
to every vertex v
V ;(4) E
V AND
v 1 =
is the set of edges in the tree, and (5) for each internal nodes, the children
are ordered from left to right.
This problem is generally defined as: given a database of trees T db and minimum
support threshold
v 2 }
times in T db .Most
commonly considered subtrees are induced and embedded. The formal defini-
tions of induced and embedded subtrees are as follows [ 42 ]: Given a tree S
˃
, find all subtrees that occur at least
˃
=
(
vs 0 ,
V S ,
L S ,
E S )
and tree T
= (
vt 0 ,
V T ,
L T ,
E T )
, S is an ordered induced subtree of
T iff (1) V S
E T ; (4) the left-to-
right ordering of sibling nodes in the original tree is preserved. Moreover, S is an
ordered embedded subtree of T iff (1) V S
V T ;(2) L S
L T , and L S (
v
) =
L T (
v
)
;(3) E S
V T ;(2) L S
L T , and L S (
v
) =
L T (
v
)
;
(3) if
v 1 in S and v 1 is ancestor of v 2 in T , and
(4) the left-to-right ordering of sibling nodes in the original tree is preserved. If
S
(
v 1 ,
v 2 )
E S then parent
(
v 2 ) =
= (
vs 0 ,
V S ,
L S ,
E S )
is an embedded subtree of T
= (
vt 0 ,
V T ,
L T ,
E T )
, and two
vertices v 1
V S form ancestor-descendant relationship, the level of
embedding (LoE) [ 42 ], between v 1 and v 2 , denoted by
V S and v 2
, is defined as the
length of the path between v 1 and v 2 in T . Hence, a maximum level of embedding
constraint (MaxLoE) M can be imposed on the subtrees extracted from T , such that
any two connected nodes in an embedded subtree of T will be connected in T by a
path that has the maximum length of M . Examples of induced and embedded sub-
tree are given in Fig. 10.1 (the number on the left of the nodes indicate its pre-order
position in the original tree T ).
(
v 1 ,
v 2 )
 
Search WWH ::




Custom Search