Database Reference
In-Depth Information
particular file or a set of files. They used the software maintenance records to provide
the training data set in a DM process.
2.3 Data Mining Tools in SDLC
The most commonly used DM techniques in SDLC are decision trees, neural
networks and association analysis. A DM tool used in SDLC should be effective to
utilise these techniques, easy to use, support data preparation and most importantly,
be able to present results in a succinct manner. The general-purpose DM tools such as
SAS 'Enterprise Miner' and Statsofts's 'Statistica Data Miner' (known for the user-
friendly drag-and-drop workspace and good reporting functions) can be used. There
also exits DM tools especially built to assist in SDLC process. An example is
EMERALD [4], Enhanced Measurement for Early Risk Assessment of Latent
Defects, for assessing reliability risk for software developers and managers. This tool
has been used in number of studies, e.g., [19] used EMERALD for predicting fault
ranges of software modules with Fuzzy Nonlinear Regression.
2.4 Major Issues Arising with Applications of Data Mining in SDLC
General problems encountered with data such as over-fitting/poor-fitting, missing and
noisy values, large size and dimensionality, still remain the same for this domain as
others. Some of the issues listed below can be considered as major requirements and
challenges for the further evolution of DM technology in SDLC.
Diversity of the Data Types: Large software projects often keep huge amounts of data
spread over different non-consolidated repositories such as source code repositories e.g.,
CVS * ; conceptual models of software e.g., UML ; modelling tools e.g., Rational Rose ;
project management tools and documentation tools. Additionally, data collected during
a SDLC process reside in many sources such as flat files, relational databases, data
warehouses, transactional databases, advance database systems (including object-
oriented, object-relational, multimedia and specific application-oriented databases) and
the Web. While DM is applicable to any kind of data, the challenges and techniques
may vary depending on the repository type.
A DM system must be able to deal with data drawn from different sources and
formats. Without proper pre-processing, analysis of data to uncover patterns will be
difficult since bad quality data ultimately leads to useless discoveries. The pre-
processing module should include the use of simple query languages to extract data
from various repositories, integrate, select, assess for quality and convert to the format
suitable to the analysis tool.
Mining Methodology and User Interaction Issues : Data residing in many sources
also poses a problem in mining of knowledge at multiple levels [8]. This raises the
problem of finding associations among these various sets of extracted knowledge. The
integration of extracted relationships from various sources is an unresolved issue [18].
* CVS is a common method to store source code of a project in a centralized repository.
Unified Modelling Language is a popular modelling language in SDLC.
An industry standard multi-purpose modelling tool (by IBM) for software projects.
Search WWH ::




Custom Search