Information Technology Reference
In-Depth Information
Table 10.9 Subtree association rule evaluation for CRM data
Type of analysis
Data partition
FullTree Induced
#ofRules AR% CR% #ofRules AR% CR%
# of Rules after ST
Training
27116
83.02
100
5270
81.56
100
Testing
83.74
100
83.4
100
Logistic regression
Training
91
79.85
100
17
68.54
100
Testing
80.95
100
70.57
100
Redundancy removal Training
51
76.78
100
17
68.54
100
Testing
77.72
100
70.57
100
Min. Conf. 60%
Training
44
83.82
95.50
12
77.20
91.53
Testing
84.57
96.15
79.18
93.59
for training and 34% for testing. However, there are many complex classes within
this CRM data which may interest the users of the data. Nevertheless in this case,
as our main purposes is not to analyse the problem of CRM itself, but to look at the
CRMdata as an example of tree-structured data, the attention is confined to the afore-
mentioned class. The resulting DSM based flat data format contains 675 attributes
(including the class), 586 selected attributes based on Symmetrical Tau(ST) feature
selection. The rules are then generated based on support of 5% and confidence of
50%. Note that initially the dataset with backtrack attributes was used, which caused
memory issues in the SAS software and hence we applied the ST feature selection
prior to generating association rules which removed all of the backtrack attributes in
this dataset. Furthermore, for this dataset, all subtrees generated are of induced type,
and hence we do not report any results for the embedded subbtree variation as it is
identical to induced for this data.
Table 10.9 shows the results as the statistical analysis and the redundancy assess-
ment have progressively been utilized to evaluate the interestingness of rules. Note
that chi-square analysis is not presented as it did not result in any rule removal at
that stage, and all of the connected subtrees were of induced subtree type in this
dataset. As one can see a significant number of rules was removed by applying the
logistics regression analysis, and in FullTree rule set further 40 rules were detected
as redundant. This has reduced the AR% by about 3%, but after rules whose min-
imum confidence is below 60% have been removed (last row) the accuracy has
increased with the cost of not covering around 5% of the instances. In this exper-
iment, FullTree rule set is the most optimal one, as it is not only more accurate in
classifying/predicting specific instances in the database, but also achieves a higher
coverage rate in the final step compared to Induced rule set. The FullTree rule set
can contain rules that do not convert to valid (connected) subtrees when matched to
DSM. Nevertheless, these are important to include as they may represent important
associations that should not be lost because they do not convert to connected valid
subtrees. Note that we have tried to run the XRules structural classifier [ 52 ]onthis
data, but since there are quite a few repeating node labels in single tree instances,
caused by repetition of defects and individual cases within a single record, the tree
 
Search WWH ::




Custom Search