Data Mining Project: A Critical Element in Teaching, Learning and Assessment of a Data Mining Module - Advances in Databases

Database Reference

In-Depth Information

The Student Work

The project was undertaken also in 2010, but by a single student against the tutor's

advice. Key points of the project work are summarised as follows:

•

Data understanding. The student decided to carry out a brute force bottom-up

discovery of any potential patterns. At this stage, the student used Weka and Ex-

cel to gain understanding about domain types, value distributions of attributes,

and unknown values. Unlike the group for project one, this student did not iden-

tify any anomalies, but spotted that values for the contracting sum attribute were

extremely skewed towards the lower end.

•

Data preparation and pre-processing. Similar to project one, regional codes were

replaced with nominal labels (A, B, C, D and E, where E for unknown). The age

attribute was discretized into more natural age groups such as child, teenager,

young, adult and senior. Because of the skew of the contracting sum values, the

student decided to apply logarithm transformation on the original values so that

the levels of magnitude instead of the actual figures of contracting sums were

considered.

•

Data modelling/mining. The student took the decision to do every data mining

task: classification, clustering and association mining. The student laboriously

tried 4 methods for clustering, 10 methods for classification, and 2 methods for

association rule discovery. For clustering, different values of k were attempted

for the k-means and the EM methods. For classification, the student attempted to

induce classification models for product types, and used the Weka Experimenter

to compare performances among the classification methods. Little explanation

was given regarding the setting of parameters. For association rule discovery,

confidence and accuracy were used for selecting top 10 rules.

•

Post-processing. The student was conscious about the value for k and used the

evaluation of cluster quality to determine the optimal value. However, except the

performance analysis using Experimenter, very little attention was paid to the

detailed performance evaluation of different classification models shown in con-

fusion matrices. The student did not notice the strength of JRip method in classi-

fying life insurance buyers. The student did not pay attention to the meaning of

association rules at all.

The Assessment

The project is a showcase of trial-and-error gone to the extreme. Many trials were

made and many patterns were discovered, but these patterns are not carefully exam-

ined. There are gems of good ideas here and there in some individual tasks, such as

the logarithm transformation for the contracting sum attribute, the comparative study

of techniques for classification, etc., but the project as a whole is not a piece of coher-

ent work. The student did not realised the complexity of most decision trees, and

totally ignore the inappropriate associations (e.g. contracting sums with their

logarithm-transferred values). The project can only be classified as fair with a total

percentage mark of 48% (10 for Data Understanding, 15 for Data Preparation and

Pre-processing, 10 for Data Modelling/Mining, 8 for Post-processing and 5 for project

management).

Advances in Databases

Search WWH ::

Custom Search

Home