Information Technology Reference
In-Depth Information
T:
T
:
T
:
T
:
5
0
T:6
'b'
0
'b'
0
'b'
0
'b'
0
'b'
7
'b'
1
'b'
2
'e'
5
'c'
6
'c'
7
'b'
6
'c'
7
'b'
3
'e'
7
'b'
2
'e'
5
'c'
8
'e'
4
'c'
3
'e'
4
'c'
8
'e'
8
'e'
8
'e'
T:3
T:4
0
'b'
0
'b'
3
4
'e'
'c'
6
'c'
7
'b'
Fig. 10.1 Example of induced/embedded subtrees ( T 1 , T 2 , T 4 , T 6 ) and embedded subtrees ( T 3 , T 5 )
of tree T
In this chapter, the focus is on evaluating rules based on embedded and induced
subtrees that satisfy minimum support and confidence thresholds, and discarding
any rules determined to be irrelevant to the classification task at hand. Let us denote
the subtree patterns from the frequent subtree set SF that have a class label (value),
as SFC , their accuracy as ac
. The problem
focused on in this work can then be generally defined as follows: Given SFC with
accuracy ac ( SFC ), obtain SFC
(
SFC
)
and coverage rate as cr
(
SFC
)
SFC ) (
SFC , such that ac
(
ac
(
SFC
) ʵ)
and
SFC ) (
cr
is an arbitrary user defined small value used to reflect
the noise that is often present in real-world data).
In what follows we discuss the common way of representing trees. This will
lay the necessary ground for understanding the positional constraint imposed by
the DSM approach [ 14 ]. A pre-order traversal can be computed as follows: If an
ordered tree T consists only of a root node r , then r is the pre-order traversal of T .
Otherwise let T 1 ,
(
cr
(
SFC
) ʵ)
(
ʵ
T n be the subtrees occurring at r from left to right in T .
The pre-order traversal begins by visiting r and then traversing all the remaining
subtrees in pre-order starting from T 1 and finishing with T n . The string encoding
(
T 2 ,...,
˕
) can be generated by adding vertex labels in the pre-order traversal of a tree
T
L ) whenever we
backtrack from a child node to its parent node. Figure 10.2 and Table 10.1 depict a
tree database consisting of 7 tree instances (or transactions) and the string encoding
for tree database, respectively.
= (
v 0 ,
V
,
L
,
E
)
and appending a backtrack symbol (e.g., '/', '/'
10.3.1 Feature Subset Selection
Feature subset selection is an important pre-processing step in the data mining
process. If the irrelevant attributes are left in the dataset, they can interfere with
the data mining process and the quality of the discovered patterns may deteriorate,
creating problems such as overfitting [ 9 ]. It is in particular the case in associative
classifiers, since frequent patterns are typically generated without considering their
predictive power [ 9 ], resulting in a huge feature space for possible frequent patterns.
The removal of irrelevant attributes will result in a much smaller dataset, thereby
 
Search WWH ::




Custom Search