Use of Data Mining in System Development Life Cycle - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

particular file or a set of files. They used the software maintenance records to provide

the training data set in a DM process.

2.3 Data Mining Tools in SDLC

The most commonly used DM techniques in SDLC are decision trees, neural

networks and association analysis. A DM tool used in SDLC should be effective to

utilise these techniques, easy to use, support data preparation and most importantly,

be able to present results in a succinct manner. The general-purpose DM tools such as

SAS 'Enterprise Miner' and Statsofts's 'Statistica Data Miner' (known for the user-

friendly drag-and-drop workspace and good reporting functions) can be used. There

also exits DM tools especially built to assist in SDLC process. An example is

EMERALD [4], Enhanced Measurement for Early Risk Assessment of Latent

Defects, for assessing reliability risk for software developers and managers. This tool

has been used in number of studies, e.g., [19] used EMERALD for predicting fault

ranges of software modules with Fuzzy Nonlinear Regression.

2.4 Major Issues Arising with Applications of Data Mining in SDLC

General problems encountered with data such as over-fitting/poor-fitting, missing and

noisy values, large size and dimensionality, still remain the same for this domain as

others. Some of the issues listed below can be considered as major requirements and

challenges for the further evolution of DM technology in SDLC.

Diversity of the Data Types: Large software projects often keep huge amounts of data

spread over different non-consolidated repositories such as source code repositories e.g.,

CVS * ; conceptual models of software e.g., UML † ; modelling tools e.g., Rational Rose ‡ ;

project management tools and documentation tools. Additionally, data collected during

a SDLC process reside in many sources such as flat files, relational databases, data

warehouses, transactional databases, advance database systems (including object-

oriented, object-relational, multimedia and specific application-oriented databases) and

the Web. While DM is applicable to any kind of data, the challenges and techniques

may vary depending on the repository type.

A DM system must be able to deal with data drawn from different sources and

formats. Without proper pre-processing, analysis of data to uncover patterns will be

difficult since bad quality data ultimately leads to useless discoveries. The pre-

processing module should include the use of simple query languages to extract data

from various repositories, integrate, select, assess for quality and convert to the format

suitable to the analysis tool.

Mining Methodology and User Interaction Issues : Data residing in many sources

also poses a problem in mining of knowledge at multiple levels [8]. This raises the

problem of finding associations among these various sets of extracted knowledge. The

integration of extracted relationships from various sources is an unresolved issue [18].

* CVS is a common method to store source code of a project in a centralized repository.

† Unified Modelling Language is a popular modelling language in SDLC.

‡ An industry standard multi-purpose modelling tool (by IBM) for software projects.

Search WWH ::

Custom Search

Home