Database Reference
In-Depth Information
The predictive modelling or classification task builds a model by recognising
distinct characteristics of the data set. We have chosen tree induction or decision tree
(DT) due to their simplicity, efficiency and capability of dealing with noise and large
data. The size of a DT depends on the number of attributes used to construct it.
Because the number of attributes in our problem is small, the resulting DT is
relatively simple and thus its structure is understood easily by a human analyst.
The link analysis operation exposes samples and trends by predicting correlation of
variables in a given data set. We have used the Apriori algorithm [6] to reveal hidden
affinity among the variables if a PR report is being raised.
We have used the C5 [2], CBA [1] and the TextAnalyst [3] tools for classification,
both classification and association, and text mining respectively.
3.3 Assimilation and Analysis of Outputs
Classification and Association Rule Mining: In order to get better rules and to
decrease the error rate, several approaches are used. One approach is to stratify the
data on the target using the choice-based sampling instead of using random samples.
Equal numbers of samples representing each possible value of the target attribute
(Class) are chosen for training. This improves the possibility of finding rules that are
associated with the small groups of values during training. Another approach is to
choose different amounts of PR data as training sets.
We used different training data sets. The first data set (Case 1, Table 1) contains
1224 PRs belonging to a specific software project out of total 11,000 PRs. The second
data set (Case 2) contains the equal distributed target values for a medium size of
3400 PRs (about 900 PRs from each value of ' Class ') from all software projects. The
third data set (Case 3) contains a large size of 5381 PRs from all software projects.
We also performed the randomly selected PRs in 10-fold cross-validation
experiments. The cross validation technique splits the whole data set into several
subsets (called folds). Let each fold to be the test case and the rest as training sets in
turn during training.
Experiments were conducted to test both type of time attribute - manually
discretised or continuous values (labelled D or C in Table 1). Table 1 reports the
classification mining results on all three cases, the associative rule mining results as
Case 4, and (average) 10-fold cross-validation results.
We used two learning engines to discover rules from the PR data set- single
support CBA (labelled SS in the Table 1 e.g., Case1-SS) and multiple support CBA
(labelled b in the Table 1 e.g., Case1-MS). Constraints, support and confidence, are
included in rules to control the quality of results. Confidence is the measure of the
strength of a rule that indicates the probability of having consequence(s) in the rules
provided that the rule contains certain antecedent(s). Support indicates the number of
input data supporting the rule.
Some of the attributes in the data do not have uniform distributions, and many
attributes are of very low frequency. Therefore a single support for all attributes is not
able to discover important rules. This problem is relieved by setting multiple supports
that allow user to choose different minimum supports to different attributes.
In general, all classification results in CBA achieve around 46% error rate in
training data set (the lowest is 43.51%, the highest is above 59.10%). Above 51%
Search WWH ::




Custom Search