Information Technology Reference
In-Depth Information
n
ʲ i at i =
1 {
y r ln
[ ˀ(
at i val r ) ]+ (
1
y r )
ln
[
1
ˀ(
at i val r ) ]}
(10.3)
r
=
The statistical hypothesis is then used to determine whether the input attributes
are significantly related to the class attribute. A number of models can be developed
from logistic regression analysis, and each produces a different selection of attributes.
The model that fits the data well and has the highest predictive capability is selected.
Hence, logistic regression is used to discard any fA k
F
(
A
),
fB k
F
(
B
),
fC k
F
ʲ i at i value is not significant
towards the class attribute Y (logistic regression analysis in Eq. 10.3 ).
Redundant and Contradictive Rule Removal : To remove redundant rules, we uti-
lize the concept of productive rules [ 4 ]. This approach is based onminimum improve-
ment redundant rule constraint [ 4 ], which discards any rule x
(
C
)
for which
at i contained in x of x
y ,the
y if confidence
(
y with con-
fidence value c 1 is considered as redundant if there exists another rule z
x
y
)
max(confidence
(
z
y
))
z
x . In other words, a rule x
y with
confidence value c 2, where z
c 2. The contradictory rule constraint
[ 53 ] is then utilised to discard two or more rules that have the same precedent but
imply a different class value.
Rules Accuracy and Rules Coverage : A measure needs to be applied to verify
whether the removal of a large volume of rules based on statistical analysis, and
redundancy and contradictory assessment methods, will enable the discovery of all
the interesting and significant subtree patterns. As such, the quality of the subtree
pattern will be demonstrated based on their accuracy and coverage values. The values
for rule accuracy and coverage will be measured at every stage and sequence of this
task. This measure is crucial as it can determine the quality of the discovered rules.
Additionally, this analysis will reveal the balancing/optimization issues with regards
to the trade-off between accuracy rate and coverage rate.
x and c 1
10.5 Experimental Evaluation
In this section we present the experiments performed using the CRM dataset
(real estate property management records in XML), CSLOGS dataset (web access
trees) and an academic institution dataset (web access trees), structural character-
istics of which are shown in Table 10.8 , and the following notation is used:
|
Tr
|
Number of transactions (independent tree instances);
|
L
|
—Number of unique labels;
|
—Fan-out-factor (or
degree). Please note, that in [ 52 ] where the structural/XML classificatotion was first
proposed, it was demonstrated that a simpler classifier that does not take the struc-
ture into the account cannot achieve equally good results. Similarly, in [ 51 ]itwas
empirically shown, that tree-structured web-browsing patterns are more informative
and useful than, their itemset/sequential pattern counter part. Hence, this study is not
repeated in this work, but rather an experimental study is presented on the use of
T
|
—Number of nodes (size) in a transaction;
|
D
|
—Depth;
|
F
|
 
Search WWH ::




Custom Search