Database Reference
In-Depth Information
The Student Work
The project was undertaken also in 2010, but by a single student against the tutor's
advice. Key points of the project work are summarised as follows:
Data understanding. The student decided to carry out a brute force bottom-up
discovery of any potential patterns. At this stage, the student used Weka and Ex-
cel to gain understanding about domain types, value distributions of attributes,
and unknown values. Unlike the group for project one, this student did not iden-
tify any anomalies, but spotted that values for the contracting sum attribute were
extremely skewed towards the lower end.
Data preparation and pre-processing. Similar to project one, regional codes were
replaced with nominal labels (A, B, C, D and E, where E for unknown). The age
attribute was discretized into more natural age groups such as child, teenager,
young, adult and senior. Because of the skew of the contracting sum values, the
student decided to apply logarithm transformation on the original values so that
the levels of magnitude instead of the actual figures of contracting sums were
considered.
Data modelling/mining. The student took the decision to do every data mining
task: classification, clustering and association mining. The student laboriously
tried 4 methods for clustering, 10 methods for classification, and 2 methods for
association rule discovery. For clustering, different values of k were attempted
for the k-means and the EM methods. For classification, the student attempted to
induce classification models for product types, and used the Weka Experimenter
to compare performances among the classification methods. Little explanation
was given regarding the setting of parameters. For association rule discovery,
confidence and accuracy were used for selecting top 10 rules.
Post-processing. The student was conscious about the value for k and used the
evaluation of cluster quality to determine the optimal value. However, except the
performance analysis using Experimenter, very little attention was paid to the
detailed performance evaluation of different classification models shown in con-
fusion matrices. The student did not notice the strength of JRip method in classi-
fying life insurance buyers. The student did not pay attention to the meaning of
association rules at all.
The Assessment
The project is a showcase of trial-and-error gone to the extreme. Many trials were
made and many patterns were discovered, but these patterns are not carefully exam-
ined. There are gems of good ideas here and there in some individual tasks, such as
the logarithm transformation for the contracting sum attribute, the comparative study
of techniques for classification, etc., but the project as a whole is not a piece of coher-
ent work. The student did not realised the complexity of most decision trees, and
totally ignore the inappropriate associations (e.g. contracting sums with their
logarithm-transferred values). The project can only be classified as fair with a total
percentage mark of 48% (10 for Data Understanding, 15 for Data Preparation and
Pre-processing, 10 for Data Modelling/Mining, 8 for Post-processing and 5 for project
management).
Search WWH ::




Custom Search