Database Reference
In-Depth Information
should normally involve one of three main types of data mining, i.e. classification,
cluster analysis and association discovery. In the cases when more than one type is
required for the intended business purpose, the number of mining tasks for each type
should be limited. Doing one thing properly is always more desirable than attempting
many superficially. Comparing to real-life data mining, this project is indeed a small
scale mini-project in every way.
The project suits a full module in the final year (FHEQ level 6) worth 15 to 20
units of credit. A specification for the module and its pre-requisite are presented in
[3]. Some elementary knowledge in probability and statistics is assumed.
2.2 Project Content and Deliverables
The project should concentrate on the following main stages of the data mining process:
1. Data understanding. This stage involves activities in studying the data and data
backgrounds, understanding related business activities from which the data are
collected, conducting exploratory summaries and outlining possible directions for
discovery.
2. Data preparation. The project work at this stage includes tasks in preparing and
formatting data, pre-processing the data (such as discretisation, transformation, at-
tribute selection, sampling, etc.) and if possible improving data quality.
3. Data modelling/mining. This stage is concerned with selecting suitable data min-
ing solutions, setting appropriate parameters for the solutions, observing results
and deciding if alternative mining solutions are needed, and whether any further
data preparation is required before another round of mining begins.
4. Post processing. This stage of the project involves collecting results, evaluating
the patterns for their significance and quality, attempting to interpret the patterns,
and evaluating their fitness to the purposes outlined in 1.
Two main phases of the CRISP-DM standard, i.e. business understanding and de-
ployment, have not been mentioned. This is because the true and complete business
context of a selected data set may not be available, and hence it is difficult to mock
the business reality. However, the tutor and students should seek maximal amount of
information about the data background from limited sources. Students should make
effort in considering possible deployments of useful patterns given the limited under-
standing of the application.
The deliverables for the project include a written report and an oral presentation
from each project group. The report documents details of the project work and ration-
ales behind them at each stage of the data mining process. The oral presentation aims
to outline main issues with the data set, highlight major project tasks and key find-
ings, justify any decisions taken, and defend the project work.
2.3 Related Issues
A number of issues regarding the project must be addressed. First, a suitable data set
should be located. Such data sets were hard to find in the early years. Since 2005,
increasing numbers of data sets from the public domain and commercial sources have
become available online [8]. The project may require a single data set for all project
Search WWH ::




Custom Search