Database Reference
In-Depth Information
unsupervised equal-length method with default parameter setting (10 bins). The
values for the contracting sum attribute were also discretized into 8 bins using the
same method. Figure 2(a) shows the result of the discretisation for the contracting
sum attribute.
Data modelling/mining. One classification model using a decision tree method
was obtained. The tree has an overall accuracy of 75%. Figure 2(b) presents the
evaluation details of the tree. 8 quantitative association rules with support of 10%
and confidence of 91% were also discovered. No explanation was given about the
selection of the rules. Redundancy exists between two of the rules.
Post-processing. Little attempt regarding evaluation of patterns was made. No
clear interpretation of the patterns was given.
The Assessment
The project does not outline any directions for the discovery. Understanding of data
characteristics is limited, which leads to the random decision of using an unsupervised
equal-length method for data discretisation. No reasons were given regarding why 10
and 8 bins are chosen for discretising the age and contracting sum attributes. A clear
sign of concern, as indicated by the circle in figure 2(a), was ignored. Some credits
should be given for the handling of the anomalies and the replacement of regional
codes. The project shows serious weaknesses in the data modelling/mining stage.
Only one trial of decision tree induction was attempted without justification. The
purpose of the association rules is not clear. The weakest point of the whole project is
post-processing. Little attention was paid to the performances of the patterns. The
students did not realise that the tree is almost useless for classifying who are paying
for the first rate (as indicated by the the confusion matrix in figure 2(b)). The break-
down of marks is as follows: 5 out of 20 to Data Understanding, 12 out of 25 to Data
Preparation and Pre-processing, 10 out of 25 to Modelling/Mining, and 4 out of 20 to
Post-processing. Because of the disorganised approach to work, only 3 out of 10
marks were given to the project management. With the total mark of 34%, the project
is unsatisfactory.
3.2 Project Two: The Good
The Data Set
This project uses a public domain data set about heart diseases donated by Cleveland
Clinic Foundation. The data set has 303 records and 14 attributes. The attributes
represent patient age, patient gender, and a range of clinic test results. The result
measurements include chest pain type, resting blood pressure, amount of cholesterol,
fasting blood sugar greater than 120, resting electrocardiographic result, maximum
heart rate, presence of exercise-induced angina, ST depression, slope of peak exercise
ST segment, number of major vessels coloured by fluoroscopy, thalassaemia, and
angiographic disease status. The number of data records is limited. Students are ex-
pected to show good use of the limited data. Potentially interesting patterns would be
classification models regarding the presence of heart disease.
Search WWH ::




Custom Search