DynaMine: Finding Usage Patterns and Their Violations by Mining Software Repositories - Mining Software Specifications: Methodologies and Applications

Databases Reference

In-Depth Information

values are high are favored. In the rest of the chapter we refer to the ranking

that follows classical data mining techniques as regular ranking.

7.3.3.2

Corrective Ranking

While the ranking schemes above can generally be applied to any data

mining problem, we have come up with a measure of a pattern's importance

that is specific to mining revision histories. Observation 7.2.3 is the basis of

the metric we are about to describe. A check-in may only add parts of a usage

pattern to the repository. Generally, this is a problem for the classic Apri-

ori algorithm, which prefers patterns, all parts of which are \seen together."

However, we can leverage these incomplete patterns when we realize that they

often represent bug fixes.

A recent study of the dynamic of small repository changes in large soft-

ware systems performed by Purushothaman et al. sheds a new light on this

subject [36]. Their paper points out that almost 50% of all repository changes

were small, involving less than 10 lines of code. Moreover, among one-line

changes, less than 4% were likely to cause a later error. Furthermore, only

less than 2.5% of all one-line changes were perfective changes that add func-

tionality, rather than corrective changes that correct previous errors. These

numbers imply a very strong correlation between one-line changes and bug

corrections or fixes.

We use this observation to develop a corrective ranking that extends the

ranking that is used in classical data mining. For this, we identify one-line fixes

and mark method calls that were added at least once in such a fix as fixed. In

addition to the measures used by regular ranking, we then additionally rank

by the number of fixed methods calls which is used as the first lexicographic

category. As discussed in Section 7.5, patterns with a high corrective rank

result in more dynamic violations than patterns with a high regular rank.

7.3.4 Locating Added Method Calls

In order to speed up the mining process, we pre-process the revision history

extracted from CVS and store this information in a general-purpose database;

our techniques are further described in Zimmermann et al. [52]. The database

stores method calls that have been inserted for each revision. To determine

the calls inserted between two revisions r 1 and r 2 , we build abstract syntax

trees (ASTs) for both r 1 and r 2 and compute the set of all calls C 1 and C 2 ,

respectively, by traversing the ASTs. C 2 nC 1 is the set of inserted calls between

r 1 and r 2 .

Unlike Williams and Hollingsworth [46, 47] our approach does not build

snapshots of a system. As they point out such interactions with the build

environment (compilers, makefiles) are extremely dicult to handle and result

in high computational costs. Instead we analyze only the differences between

single revisions. As a result our preprocessing is cheap and platform- and

Search WWH ::

Custom Search

Home