Information Technology Reference
In-Depth Information
Table 10.14 Academic Institution flattened data characteristics and initial number of rules for
varying support
Support threshold (%)
Atrr. #
# Selected attr.
# of Rules with target attr.
DSM flat
Sym. Tau
FullTree
Embedded
Induced
1
442
217
-
-
-
5
126
123
28282
28282
28282
10
70
63
234
234
234
20
36
29
50
49
49
30
26
19
14
13
13
to trees as was explained with the illustrative example in Sect. 3.1 . The resulting
dataset had 18,836 instances, of which 66% was used for training and the remainder
for testing. The details of the setting of the WebLogs access can be found in [ 16 ].
The general characteristics of the flat data format (including backtrack attributes)
and initial number of rules extracted for education institution data (50% minimum
confidence) at varying support thresholds is provided in Table 10.14 .
In this dataset, similar to the experiments described in Sect. 10.5.2 , rules from
FullTree , Embedded and Induced rule sets have been progressively assessed with
statistical analysis and redundancy assessment method. The results demonstrate that
the conversion of the original tree-structured data into the flat data format represen-
tation, created a very large number of input attributes, especially at lower support
thresholds. By utilizing the Apriori algorithm to generate all frequent rules, one
might encounter difficulties in analyzing all rules given certain support and confi-
dence constraints.
By referring to the Table 10.15 , even with the given support constraint, the num-
ber of extracted rules (Initial Rule Set) is large. A large volume of rules may be
discovered due to the presence of irrelevant attributes in the dataset. The capabilities
of ST in selecting appropriate attributes, thereby removing irrelevant attributes, are
shown in our previous experiments for relational data problems. For this particular
task of evaluating tree-structured rules, similar experiments were conducted. The
attributes for each different support were ranked according to their decreasing ST
and a relevance cut-off point was chosen.
Table 10.15 indicates the differences between the number of initial input attributes
and the number of attributes after applying Symmetrical Tau (ST) with their respec-
tive rule number (below) for each dataset for each different support. All attributes that
have been removed from the WebLogs data are backtrack attributes. This indicates
that the inclusion of these backtrack nodes may not be useful or have low capabilities
in predicting the class attributes in this dataset.The input variable that contains a sin-
gle value is unable to distinguish the class variables. Such input attributes have been
discarded as they are considered irrelevant based on the ST value calculated. With the
application of ST feature selection technique, rules that contain attributes that failed
the STmeasure are discarded. The large number of rules were managed to be reduced
 
Search WWH ::




Custom Search