Lightweight Mining of Object Usage - Mining Software Specifications: Methodologies and Applications

Databases Reference

In-Depth Information

by applying formal concept analysis, and its input is just a matrix showing

which sequential constraints occur in which functions. Therefore, we can put

into the matrix the data for functions that stem from different projects. Our

idea is to have a set of reference projects to mine specifications from, and one

tested project to check for violations of those specifications.

Unfortunately, there is one fundamental scalability problem that we need

to solve to put the idea above into use: If we put data from all the reference

projects and the tested project into the matrix, it can become huge. There

are two reasons for this: First, there can be a lot of reference projects { and

therefore a lot of rows in the matrix. Second, the reference projects will differ

one from another in the functions that are called in them. This will result in

a huge variety of possible sequential constraints { meaning a lot of columns

in the matrix. Formal concept analysis will most likely choke on such a huge

matrix and this would make the whole idea impractical. How can we scale up

our approach?

Recall that our goal is finding violations of mined specifications in the

tested project, in the hope that these are defects. In other words, if a specifi-

cation cannot possibly be violated in the tested project, it might just as well

not have been mined at all. This leads to an important observation: If not

all specifications stemming from reference projects are useful, then also not

all sequential constraints stemming from the reference projects are useful. If

reference projects use a certain library that is not used in the tested project,

all sequential constraints related to this library can be safely ignored. This will

reduce the number of columns in the matrix, and might even reduce the num-

ber of rows, too, if it so happens that some reference projects are completely

unrelated to the tested project (i.e., use completely different APIs). It turns

out that this (coupled with increasing minimum support; see Section 12.5) is

enough to make the idea above scale and work very well in practice.

How do we decide which sequential constraints to keep and which to re-

move? We limit ourselves to sequential constraints that pertain to the tested

project.

Definition 12.8.1. A sequential constraint e 1 e 2 pertains to the tested

project iff there exists in the tested project a function, whose sequential con-

straints abstraction contains e 1 e or e e 2 , where e is an arbitrary event.

This definition can be straightforwardly translated into an algorithm that

chooses sequential constraints to be included in the matrix. We will leave this

task as an exercise to the reader.

Limiting the matrix size by removing sequential constraints that do not

pertain to the tested project is quite effective, but is not enough if the tested

project is very large. Therefore, we exploit the minimum support parameter: If

a sequential constraint occurs less than minimum support times in the matrix,

it cannot be part of any specification (and thus neither of any violation). So

we can count the number of times each sequential constraint occurs in the

matrix and remove those that are too rare to be of use. We can do this step as

Mining Software Specifications: Methodologies and Applications

Search WWH ::

Custom Search

Home