Use of Data Mining in System Development Life Cycle - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

The predictive modelling or classification task builds a model by recognising

distinct characteristics of the data set. We have chosen tree induction or decision tree

(DT) due to their simplicity, efficiency and capability of dealing with noise and large

data. The size of a DT depends on the number of attributes used to construct it.

Because the number of attributes in our problem is small, the resulting DT is

relatively simple and thus its structure is understood easily by a human analyst.

The link analysis operation exposes samples and trends by predicting correlation of

variables in a given data set. We have used the Apriori algorithm [6] to reveal hidden

affinity among the variables if a PR report is being raised.

We have used the C5 [2], CBA [1] and the TextAnalyst [3] tools for classification,

both classification and association, and text mining respectively.

3.3 Assimilation and Analysis of Outputs

Classification and Association Rule Mining: In order to get better rules and to

decrease the error rate, several approaches are used. One approach is to stratify the

data on the target using the choice-based sampling instead of using random samples.

Equal numbers of samples representing each possible value of the target attribute

(Class) are chosen for training. This improves the possibility of finding rules that are

associated with the small groups of values during training. Another approach is to

choose different amounts of PR data as training sets.

We used different training data sets. The first data set (Case 1, Table 1) contains

1224 PRs belonging to a specific software project out of total 11,000 PRs. The second

data set (Case 2) contains the equal distributed target values for a medium size of

3400 PRs (about 900 PRs from each value of ' Class ') from all software projects. The

third data set (Case 3) contains a large size of 5381 PRs from all software projects.

We also performed the randomly selected PRs in 10-fold cross-validation

experiments. The cross validation technique splits the whole data set into several

subsets (called folds). Let each fold to be the test case and the rest as training sets in

turn during training.

Experiments were conducted to test both type of time attribute - manually

discretised or continuous values (labelled D or C in Table 1). Table 1 reports the

classification mining results on all three cases, the associative rule mining results as

Case 4, and (average) 10-fold cross-validation results.

We used two learning engines to discover rules from the PR data set- single

support CBA (labelled SS in the Table 1 e.g., Case1-SS) and multiple support CBA

(labelled b in the Table 1 e.g., Case1-MS). Constraints, support and confidence, are

included in rules to control the quality of results. Confidence is the measure of the

strength of a rule that indicates the probability of having consequence(s) in the rules

provided that the rule contains certain antecedent(s). Support indicates the number of

input data supporting the rule.

Some of the attributes in the data do not have uniform distributions, and many

attributes are of very low frequency. Therefore a single support for all attributes is not

able to discover important rules. This problem is relieved by setting multiple supports

that allow user to choose different minimum supports to different attributes.

In general, all classification results in CBA achieve around 46% error rate in

training data set (the lowest is 43.51%, the highest is above 59.10%). Above 51%

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home