Information Technology Reference
In-Depth Information
Table 10.10 CSLOGS flattened data characteristics and initial number of rules for varying support
Support threshold (%)
Atrr. #
# Selected attr.
# of Rules with target attr.
DSM flat
Sym. Tau
FullTree
Embedded
Induced
1
222
217
13835
13833
13809
5
64
52
920
919
918
10
40
29
216
215
215
20
24
11
48
47
47
30
16
7
32
31
31
mining algorithm [ 51 ] on which the XRules is based on, has difficulties in extracting
subtrees at required low support thresholds.
10.5.2 Experiment Set 2—CSLOGS Data
The CSLogs data comprises the web access trees from the computer science depart-
ment of the Rensselaer Polytechnic Institute previously used in [ 52 ] to evaluate the
XRules structural classifier. All of the three datasets (US1924, US2430, and US304)
were combined and instances were replicated (in both training and test data) to make
the class distribution even. The tree instances are labelled according to two classes,
namely the internal and external web site access. The total number of combined
instances is 68302. The training set was comprised of 66% of the data and the
remainder was left as the test set. Since different support thresholds were used, in
our approach the flat data representation of the dataset is done separately for each
support threshold, as the extracted database structure model (DSM) varies; hence,
the number of attributes used during frequent pattern generation. The general char-
acteristics of the flat data format (including backtrack attributes) and initial number
of rules extracted for CSLogs data (50% minimum confidence) at varying support
thresholds is provided in Table 10.10 . Note, that when using the association rules for
classification task it is natural that performance will vary depending on the support
threshold used. Hence, different support thresholds were tried from a larger to a
smaller extreme, and as expected for larger support thresholds there will be a trade-
off for limited coverage as only the very frequent subtrees will be extracted to form
part of the model.
For this dataset, the best results were achieved for the lowest examined support
threshold of 1%, and detailed results of progressively filtered rules based on statistical
analysis and redundancy removal are presented in Table 10.11 for support 1% (at the
end of this subsection we present the performance of final rule sets for all the support
thresholds). The number of rules are shown in brackets below each AR and CR values
reported. The results reveal that by selecting important input attributes with ST and
evaluating the rules with statistical analysis and redundancy assessment method,
 
 
Search WWH ::




Custom Search