Information Technology Reference
In-Depth Information
Table 8.6 Data sets used for experiments
Data set
Number of
Consistency (%)
Cases
Attributes
Concepts
Australian Credit Approval
690
14
2
100
Breast Cancer—Slovenia
286
9
2
95.45
Breast Cancer—Wisconsin
625
9
9
94.08
Bupa Liver Disorders
345
6
2
100
Glass
214
9
6
100
Hepatitis
155
19
2
100
Image segmentation
210
19
7
100
Iris
150
4
3
100
Lymphography
148
18
4
100
Pima
768
8
2
100
Postoperative patients
90
8
3
84.44
Soybean
307
35
19
100
Primary Tumor
339
17
21
76.40
Wine Recognition
178
13
3
100
Let us say that attribute a has missing attribute value for case x from concept C
and that the value of a for x is missing. This missing attribute value is exchanged by
the known attribute value for which the conditional probability of a for case x given
C is the largest.
Some of these data sets had numerical attributes ( Australian Credit Approval ,
Bupa Liver Disorders , Primary Tumor and Wine Recognition ). Numerical attributes
were discretized using cluster analysis methods of discretization [ 5 ].
The data mining system LERS uses for discretization a number of discretization
algorithms [ 13 ]. In our experiments we used two approaches to discretization based
on cluster analysis. First, all numerical attributeswere normalized [ 7 ] (attribute values
were divided by the attribute standard deviation).
In our first discretization technique, based on agglomerative cluster analysis [ 7 ],
initially each case is a single cluster, then clusters are fused together, forming larger
and larger clusters. In remaining four cluster analysis discretization methods, where
we used divisive techniques, initially all cases are grouped in one cluster, then this
cluster is gradually divided into smaller and smaller clusters. In both methods, during
the first step of discretization, cluster formation , cases that exhibit the most similarity
are fused into clusters.
 
Search WWH ::




Custom Search