Information Technology Reference
In-Depth Information
10.3 Problem Background
The problem of finding association rules
x
y
was first introduced in [
1
]asa
data mining task for finding frequently co-occurring items in large databases. Let
I
ₒ
be a set of items. Let
D
be a transactions database for which
each record/transaction
R
is a set of items, such that
R
={
i
1
,
i
2
,...,
i
|
}
|
I
ↆ
I
. An association rule is
an implication of the form
x
ₒ
y
where
x
ↆ
I
and
y
ↆ
I
and
x
∩
y
=∅
.The
absolute support of a rule
x
y
is the number of transactions that contain both
x
and
y
. Typically, the relative support is used, where given the support of rule
x
ₒ
ₒ
y
(denoted as
y
)) be
s
%, there are
s
% of transactions in
D
that contain items
(itemsets)
x
and
y
. In other words, the probability
P
˃
(
x
ₒ
s
%. An itemset is
frequent if it satisfies the user-specified minimum support threshold. The confidence
of a rule
x
(
x
∪
y
)
=
y
, is the estimate of conditional probability of a transaction containing
the consequent (
y
) if the transaction contains the antecedent (
x
), and is calculated as
˃(
ₒ
.
Association rule discovery finds all rules that satisfy specific constraints such
as the minimum support and confidence threshold, as is the case in the Apriori
algorithm [
1
]. When tree-structured data such as XML is in question, the under-
lying associations are tree-structured by nature. Thus, the pre-requisite for the dis-
covery of (structural) association rules becomes the task of frequent subtree min-
ing. A tree-structured document can be modeled as a rooted ordered labeled tree.
A
rooted ordered labeled tree
can be denoted as
T
x
ₒ
y
)/˃ (
x
)
V
is the root vertex; (2)
V
is the set of vertices or nodes; (3)
L
is a labelling function
that assigns a label
L
=
(
v
0
,
V
,
L
,
E
)
, where (1)
V
0
∈
(
)
∈
={
(
v
1
,
v
2
)
|
v
1
,
v
2
∈
v
to every vertex
v
V
;(4)
E
V
AND
v
1
=
is the set of edges in the tree, and (5) for each internal nodes, the children
are ordered from left to right.
This problem is generally defined as: given a database of trees
T
db
and minimum
support threshold
v
2
}
times in
T
db
.Most
commonly considered subtrees are induced and embedded. The formal defini-
tions of induced and embedded subtrees are as follows [
42
]: Given a tree
S
˃
, find all subtrees that occur at least
˃
=
(
vs
0
,
V
S
,
L
S
,
E
S
)
and tree
T
=
(
vt
0
,
V
T
,
L
T
,
E
T
)
,
S
is an ordered
induced
subtree of
T
iff (1)
V
S
E
T
; (4) the left-to-
right ordering of sibling nodes in the original tree is preserved. Moreover,
S
is an
ordered
embedded
subtree of
T
iff (1)
V
S
ↆ
V
T
;(2)
L
S
ↆ
L
T
, and
L
S
(
v
)
=
L
T
(
v
)
;(3)
E
S
ↆ
ↆ
V
T
;(2)
L
S
ↆ
L
T
, and
L
S
(
v
)
=
L
T
(
v
)
;
(3) if
v
1
in
S
and
v
1
is ancestor of
v
2
in
T
, and
(4) the left-to-right ordering of sibling nodes in the original tree is preserved. If
S
(
v
1
,
v
2
)
∈
E
S
then
parent
(
v
2
)
=
=
(
vs
0
,
V
S
,
L
S
,
E
S
)
is an embedded subtree of
T
=
(
vt
0
,
V
T
,
L
T
,
E
T
)
, and two
vertices
v
1
V
S
form ancestor-descendant relationship, the
level of
embedding
(LoE) [
42
], between
v
1
and
v
2
, denoted by
∈
V
S
and
v
2
∈
, is defined as the
length of the path between
v
1
and
v
2
in
T
. Hence, a
maximum level of embedding
constraint
(MaxLoE)
M
can be imposed on the subtrees extracted from
T
, such that
any two connected nodes in an embedded subtree of
T
will be connected in
T
by a
path that has the maximum length of
M
. Examples of induced and embedded sub-
tree are given in Fig.
10.1
(the number on the left of the nodes indicate its pre-order
position in the original tree
T
).
(
v
1
,
v
2
)
Search WWH ::
Custom Search