Data Mining Project: A Critical Element in Teaching, Learning and Assessment of a Data Mining Module - Advances in Databases

Database Reference

In-Depth Information

The Student Work

The project was undertaken by a group of three students in 2007. Details of the work

are listed as follows:

•

Data understanding. From the start, the group outlined a clear business objective,

i.e. finding patterns relating to the presence or absence of the heart disease. The

group conducted operations such as collecting and formatting data, exploring

domain types and values, obtaining descriptive statistics for numerical attributes,

and assessing data quality. Separate reports for the purposes were also produced.

•

Data preparation and pre-processing. The group focused on data cleaning by

removing outliers and filling missing values with sensible alternatives. For both

purposes, the ordinal and nominal values were first converted into discrete inte-

gers. To deal with missing values, the group decided to find the record's nearest

neighbour and use the attribute value of the neighbour to fill the missing field. To

deal with outliers, the students first plotted the data records as points in a scatter

plot and manually located those anomaly values. An anomaly value was consid-

ered as being wrongly entered and hence also replaced by the value of its nearest

neighbour. After the data cleaning operations, the discrete integers for ordinal and

nominal attributes were converted back to the original labels.

•

Data modelling/mining. The group conducted two main data mining tasks: to

build a model to classify if a patient is healthy or having the disease, and to pro-

file patients in both classes via clustering. For the first task, the group used J4.8

decision tree method with different parameter settings and 2/3-1/3 split of train-

ing-testing examples as the test option. A number of possible trees with overall

accuracy rates from 72% to 79% were obtained. The students realised that prun-

ing improves the tree accuracy. Figure 3 shows the performance summary of one

of the candidate trees. To consolidate the finding, a similar classification task was

also attempted by using the tree induction method of another tool (RDS). Some

similarities in the resulting trees were found. For the second task, the k-means

method was used with tweaking of different k values for good cluster quality, and

eventually the optimal value for k was set to 4.

•

Post-processing. Both the decision trees and the clusters were evaluated when

different parameter settings for the tree induction and different values for k were

attempted. The group also converted the trees into rules to assist their understand-

ing. By consulting external medical experts, some of the rules made good medi-

cal sense, and supported recommendations for certain people to avoid having

heart disease. The interpretation of clustering results was attempted via cluster

summary and through visualising membership of the clusters.

The Assessment

This project has paid sufficient attention to every task at every stage of the data min-

ing process. The project adheres to the CRISP-DM guideline and the tasks are per-

formed in a systematic manner. The business objective of discovery is outlined and

related to the data mining goals. Data characteristics are studied carefully, but data

summary is done only for numeric data. The methods for cleaning the data are

Advances in Databases

Search WWH ::

Custom Search

Home