Information Technology Reference
In-Depth Information
The flat data format (relational or vectorial data) was proven to be acceptable and
successful when utilized with many well-established data mining techniques. Thus,
an effective way proposed in [ 14 ] known as Database Structure Model (DSM) is
utilized in this research to represent tree-structured data in a structure-preserving flat
data format. This approach offers a way of preserving tree-structured and attribute-
value information. With the application of DSM, the structural characteristics are
preserved during the data mining process. The extracted rules from the data mining
application can be mapped onto the DSM to re-generate the pre-order string encoding
of subtrees.
Let a tree structure data in flat table format (FDT) dataset be denoted as D ,
I
={
i 1 ,
i 2 ,...,
i
| }
the set of distinct items in D , AT
={
at 1 ,
at 2 ,...,
at
| }
the
|
I
|
AT
set of input attributes in D , and Y
the class attribute with a set
of class labels in D . Assume that D contains a set of n records D
={
y 1 ,
y 2 ,...,
y
| }
|
Y
n
r
={
x r ,
y r }
,
=
1
where x r
I is an item or a set of items and y r
Y is a class label, then
|
x r |=
|
contains the attribute names and
corresponding values for record r in D for each attribute at in AT . The training
dataset is denoted as D tr
AT
|
and x r
={
at 1 val r ,
at 2 val r ,...,
at | AT | val r }
D and the testing dataset as D ts
D , and filtered
database after feature selection as D tr where I
I .
We extracted the rule sets extracted from the flat table format (FDT) satisfying
the minimum support and confidence threshold (denoted as F
(
A
)
). Individual rules
are denoted as fA
F
(
A
)
,oftheform x
y , where x is the antecedent and y
D tr ,
the consequent,
∃{
x r ,
y r }∈
x
x r ,
x r
={
at 1 val r ,
at 2 val r ,...,
at | AT | val r }
and
(
)
y
, SAS Enterprise Miner software was used.
Feature Subset Selection : The Symmetrical Tau (ST) measure [ 54 ] was derived
from the Goodman's and Kruskal's Asymmetrical Tau measure of association for
cross-classification tasks in the statistical domain. Zhou and Dillon [ 54 ]haveused
the Asymmetrical Tau measure as feature selection during decision tree building, and
have found that it tends to favour attributes with more values. When the classes of an
attribute A are increased by class subdivision, more is known about attribute A and
the probability error in predicting the class of another attribute B may decrease. On
the other hand, attribute A becomes more complex, potentially causing an increase
in the probability error in predicting its category according to the category of B.
This trade off effect inspired Zhou and Dillon [ 54 ] to combine the two asymmetrical
measures in order to obtain a balanced feature selection criterion which is in turn
symmetrical. However, note that in case of Boolean variables, symmetrical and asym-
metrical tau will have the same value. Some powerful properties of ST, as reported
in [ 54 ], are noise handling through built-in statistical strength, potential classifica-
tion uncertainties are conveyed through dynamic error estimation, no bias towards
multi-valued attributes, not proportional to sample size, proportional-reduction-in-
error nature allows measuring of sequential variation in predictive capability, and
handling of Boolean combinations of logical features.
Let there be R rows and C columns in the contingency table for attributes at i
and Y . The probability that an individual belongs to row category r and column
category c is represented as P
Y is a class label. For generating F
A
are the marginal prob-
abilities in row category r and column category c respectively. The measure is
(
rc
)
, and P
(
r
+ )
and P
( +
c
)
Search WWH ::




Custom Search