Databases Reference
In-Depth Information
by applying formal concept analysis, and its input is just a matrix showing
which sequential constraints occur in which functions. Therefore, we can put
into the matrix the data for functions that stem from different projects. Our
idea is to have a set of reference projects to mine specifications from, and one
tested project to check for violations of those specifications.
Unfortunately, there is one fundamental scalability problem that we need
to solve to put the idea above into use: If we put data from all the reference
projects and the tested project into the matrix, it can become huge. There
are two reasons for this: First, there can be a lot of reference projects { and
therefore a lot of rows in the matrix. Second, the reference projects will differ
one from another in the functions that are called in them. This will result in
a huge variety of possible sequential constraints { meaning a lot of columns
in the matrix. Formal concept analysis will most likely choke on such a huge
matrix and this would make the whole idea impractical. How can we scale up
our approach?
Recall that our goal is finding violations of mined specifications in the
tested project, in the hope that these are defects. In other words, if a specifi-
cation cannot possibly be violated in the tested project, it might just as well
not have been mined at all. This leads to an important observation: If not
all specifications stemming from reference projects are useful, then also not
all sequential constraints stemming from the reference projects are useful. If
reference projects use a certain library that is not used in the tested project,
all sequential constraints related to this library can be safely ignored. This will
reduce the number of columns in the matrix, and might even reduce the num-
ber of rows, too, if it so happens that some reference projects are completely
unrelated to the tested project (i.e., use completely different APIs). It turns
out that this (coupled with increasing minimum support; see Section 12.5) is
enough to make the idea above scale and work very well in practice.
How do we decide which sequential constraints to keep and which to re-
move? We limit ourselves to sequential constraints that pertain to the tested
project.
Definition 12.8.1. A sequential constraint e 1 e 2 pertains to the tested
project iff there exists in the tested project a function, whose sequential con-
straints abstraction contains e 1 e or e e 2 , where e is an arbitrary event.
This definition can be straightforwardly translated into an algorithm that
chooses sequential constraints to be included in the matrix. We will leave this
task as an exercise to the reader.
Limiting the matrix size by removing sequential constraints that do not
pertain to the tested project is quite effective, but is not enough if the tested
project is very large. Therefore, we exploit the minimum support parameter: If
a sequential constraint occurs less than minimum support times in the matrix,
it cannot be part of any specification (and thus neither of any violation). So
we can count the number of times each sequential constraint occurs in the
matrix and remove those that are too rare to be of use. We can do this step as
 
Search WWH ::




Custom Search