Databases Reference
In-Depth Information
values are high are favored. In the rest of the chapter we refer to the ranking
that follows classical data mining techniques as regular ranking.
7.3.3.2
Corrective Ranking
While the ranking schemes above can generally be applied to any data
mining problem, we have come up with a measure of a pattern's importance
that is specific to mining revision histories. Observation 7.2.3 is the basis of
the metric we are about to describe. A check-in may only add parts of a usage
pattern to the repository. Generally, this is a problem for the classic Apri-
ori algorithm, which prefers patterns, all parts of which are \seen together."
However, we can leverage these incomplete patterns when we realize that they
often represent bug fixes.
A recent study of the dynamic of small repository changes in large soft-
ware systems performed by Purushothaman et al. sheds a new light on this
subject [36]. Their paper points out that almost 50% of all repository changes
were small, involving less than 10 lines of code. Moreover, among one-line
changes, less than 4% were likely to cause a later error. Furthermore, only
less than 2.5% of all one-line changes were perfective changes that add func-
tionality, rather than corrective changes that correct previous errors. These
numbers imply a very strong correlation between one-line changes and bug
corrections or fixes.
We use this observation to develop a corrective ranking that extends the
ranking that is used in classical data mining. For this, we identify one-line fixes
and mark method calls that were added at least once in such a fix as fixed. In
addition to the measures used by regular ranking, we then additionally rank
by the number of fixed methods calls which is used as the first lexicographic
category. As discussed in Section 7.5, patterns with a high corrective rank
result in more dynamic violations than patterns with a high regular rank.
7.3.4 Locating Added Method Calls
In order to speed up the mining process, we pre-process the revision history
extracted from CVS and store this information in a general-purpose database;
our techniques are further described in Zimmermann et al. [52]. The database
stores method calls that have been inserted for each revision. To determine
the calls inserted between two revisions r 1 and r 2 , we build abstract syntax
trees (ASTs) for both r 1 and r 2 and compute the set of all calls C 1 and C 2 ,
respectively, by traversing the ASTs. C 2 nC 1 is the set of inserted calls between
r 1 and r 2 .
Unlike Williams and Hollingsworth [46, 47] our approach does not build
snapshots of a system. As they point out such interactions with the build
environment (compilers, makefiles) are extremely dicult to handle and result
in high computational costs. Instead we analyze only the differences between
single revisions. As a result our preprocessing is cheap and platform- and
 
Search WWH ::




Custom Search