Database Reference
In-Depth Information
The Student Work
The project was undertaken by a group of three students in 2007. Details of the work
are listed as follows:
Data understanding. From the start, the group outlined a clear business objective,
i.e. finding patterns relating to the presence or absence of the heart disease. The
group conducted operations such as collecting and formatting data, exploring
domain types and values, obtaining descriptive statistics for numerical attributes,
and assessing data quality. Separate reports for the purposes were also produced.
Data preparation and pre-processing. The group focused on data cleaning by
removing outliers and filling missing values with sensible alternatives. For both
purposes, the ordinal and nominal values were first converted into discrete inte-
gers. To deal with missing values, the group decided to find the record's nearest
neighbour and use the attribute value of the neighbour to fill the missing field. To
deal with outliers, the students first plotted the data records as points in a scatter
plot and manually located those anomaly values. An anomaly value was consid-
ered as being wrongly entered and hence also replaced by the value of its nearest
neighbour. After the data cleaning operations, the discrete integers for ordinal and
nominal attributes were converted back to the original labels.
Data modelling/mining. The group conducted two main data mining tasks: to
build a model to classify if a patient is healthy or having the disease, and to pro-
file patients in both classes via clustering. For the first task, the group used J4.8
decision tree method with different parameter settings and 2/3-1/3 split of train-
ing-testing examples as the test option. A number of possible trees with overall
accuracy rates from 72% to 79% were obtained. The students realised that prun-
ing improves the tree accuracy. Figure 3 shows the performance summary of one
of the candidate trees. To consolidate the finding, a similar classification task was
also attempted by using the tree induction method of another tool (RDS). Some
similarities in the resulting trees were found. For the second task, the k-means
method was used with tweaking of different k values for good cluster quality, and
eventually the optimal value for k was set to 4.
Post-processing. Both the decision trees and the clusters were evaluated when
different parameter settings for the tree induction and different values for k were
attempted. The group also converted the trees into rules to assist their understand-
ing. By consulting external medical experts, some of the rules made good medi-
cal sense, and supported recommendations for certain people to avoid having
heart disease. The interpretation of clustering results was attempted via cluster
summary and through visualising membership of the clusters.
The Assessment
This project has paid sufficient attention to every task at every stage of the data min-
ing process. The project adheres to the CRISP-DM guideline and the tasks are per-
formed in a systematic manner. The business objective of discovery is outlined and
related to the data mining goals. Data characteristics are studied carefully, but data
summary is done only for numeric data. The methods for cleaning the data are
Search WWH ::




Custom Search