Information Technology Reference
In-Depth Information
Table 10.8 Structural characteristics of the data
|
Tr
|
|
L
|
Avg
|
T
|
Avg
|
D
|
Avg
|
F
|
Max
|
T
|
Max
|
D
|
Max
|
F
|
CRM
1,181 10,611 52.97
4.89
8
533
5
46
CSLOGS
68,302 16,207
7.8
3.45
1.82
313
123
137
Academic Institution
Website
18,836 34,052
9.63
4.98
1.56
60
59
37
standard statistical techniques to reduce the huge number of rules typically generated
during frequent subtree mining, in the context of associative classification. As such,
the focus is on the use of basic accuracy and coverage rate rule evaluation measures
to observe the gradual difference in the rule set accuracy and coverage as different
feature/rule filtering techniques are applied.
Each dataset underwent conversion into a structure-preserving flat data format
(henceforth FDT) using the DSMapproach. The backtrack attributes informationwas
kept in DSM as this is important for preserving the structural information. Hence,
this can be used to represent the resulting rules as trees/subtrees. The backtrack
attributes can be optionally kept in the FDT as when present in rules, they indicate
the existence/non-existence of a node irrespective of the label as discussed in [ 16 ].
We have compared the results when rules are generated from itemsets including the
backtrack attributes and without, and the difference was not substantial to make it
worth reporting. Inclusion of backtrack attributes typically resulted in slightly better
results, in terms of increased rule set coverage rate and thus all experiments presented
are done using this option. When reporting the results, the following notation will be
used ST—Symmetrical Tau, AR—accuracy rate, CR—coverage rate, FullTree —the
initial rule set containing disconnected subtree and backtrack attribute based rules,
Embedded —after itemsets have been mapped to DSM (by pre-order positions) to
generate valid connected subtrees, and Induced —only subtrees where maximum
level of embedding is limited to 1 (i.e. parent-child relationships among the nodes,
see Sect. 10.3 ).
10.5.1 Experiment Set 1—CRM Data
CRM data is a real-world dataset relating to the handling of complaints in the area of
real estate. Each complaint relates to a particular defect in the property, and a prop-
erty manager will assign a case to each defect, containing information such as case
managers, contractors, areas of defect, district and building type. The classification
problem considered corresponds to the “WorkCompletion”, with 2 possible values
(within a month and more than a month duration. The attributes containing similar
information or referring to work/task completion duration have then been removed.
The dataset consists of 1,181 instances with 675 attributes, of which 66% was used
 
 
Search WWH ::




Custom Search