Database Reference
In-Depth Information
groups, or different data sets for different project groups. Given the time constraint
and the project complexity, only one data set should be used by a project group. It
should not be too large in size or too high in dimensionality. A data set with hundreds
or even thousands of records and tens of variables would be ideal.
One potential concern regarding data sets is permission for use. Most data sets
from the public domain or downloaded from the internet come with such permission.
Nonetheless, both students and the tutor must check if the permission has been
granted before using a data set.
Data mining software is another issue to be addressed. Existing software can be
categorised into commercial systems and free tools. The commercial systems, e.g.
Oracle Data Miner, are built to cope with the workload of real-life data sets of large
sizes and high dimensionality. However, these systems are often cumbersome to learn
and use with limited choices of solutions. On contrast, the free tools are often light-
weight, easy to learn and use. Although many free tools fail to cope with data sets of
extremely large sizes and high dimensionality, they should be sufficient for the kind
of data sets for the project. Weka [12] is a free downloadable tool that has been
widely used. Its Explorer module has a simple graphical user interface through which
small-scale data mining and data exploration can be performed. The Knowledge Flow
module can be used for a more serious piece of data mining through carefully de-
signed task flows. The Experimenter module enables comparison on performances of
classification methods. The free license overcomes the availability constraint.
A single data mining tool may not always meet all requirements of the project. For
example, some data pre-processing may be better done using another tool such as
Microsoft Excel before the data set is loaded into Weka. All practical knowledge of
the mining tool is acquired through purposely designed practical classes. Practical
knowledge of other software tools can be obtained either via added practical sessions
for the module or through transferable skills from early modules.
2.4 Administration of the Project
The administration of the project follows the data mining lifecycle as described in 2.2.
At the beginning, the tutor provides students with a specification document and even
a presentation. This is followed by a period for data selection and group formation by
students. Each group has a leader. The role may be taken by a specific member or
played in rotation by all members of a group. Each group should then arrange a start-
off meeting with the tutor to gain more understanding about the background of the
chosen data set and present a project plan. During the project period, each group
should hold regular meetings with the tutor to report the project progress and discuss
issues arising. The tutor should by no means intervene in project decisions and activi-
ties. The tutor should play the roles of a monitor, a critic and a fictional client. The
tutor monitors student progress through a sequence of small deliverables such as ver-
bal reports, demonstrations, etc. By the end of the project, the reports from all groups
may be compiled into a single proceeding and shared among all students of the class
before the oral presentation is held.
Ideally, the module progression on key topics of the subject should coincide with
the project lifecycle. At the beginning when the project specification is given out, the
module introduces data mining concepts, principles and methodologies. The module
Search WWH ::




Custom Search