Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

proposed by Corazza et al. [9]. In particular, since the sole syntactic information

is not sucient to decide whether two code fragments are clones or not, the

authors proposed to enrich the information present in each (internal) node of

the AST by annotating them with the lexemes gathered from the corresponding

leaf nodes. Authors reported a formal definition of such Tree Kernel function

for clone detection, together with its experimental evaluation. Results showed

that the Tree Kernel approach applied on modified ASTs is able to outperform

a purely AST-based technique in detecting up to Type 3 clones [9].

The main drawback of this approach is that it could not be effectively applied

in the identification of Type 4 clones, as the definition of similarity it embeds

mainly considers the program text of the compared code fragments. Conversely,

to deal with Type 4 clones, the information about the program behavior becomes

particularly relevant. As a matter of fact, most of the clone detection techniques

that are able to detect Type 4 clones [28,25,17] use Program Dependency Graphs

(PDGs) to represent the source code. In a PDG, nodes correspond to simple

statements and control flow predicates, and edges encode data and control de-

pendencies [17].

To this aim, we are currently investigating the opportunity to apply Graph

Kernels to PDGs, to detect meaningful similar subgraphs. However, the main

limitation of such approaches regards the computational effort they require,

which is in fact much larger than what is needed by Tree Kernels. Thus, to

find a good trade-off between such cost and the information considered in the

(sub)graphs comparison, we are focusing on the application of Weighted Decom-

position Kernels (WDK) [35] as they enable to define specific criteria to reduce

the total number of comparisons.

4.2 Automatic Generation of Training Data

A typical problem in developing Machine Learning approaches regards the ne-

cessity to arrange two different sets of annotated data, namely the training and

the assessment sets. This problem is particularly important in the case of clone

detection where these labeled data sets are even more dicult than usual to

produce, since a manual annotation of large systems is infeasible. The generally

adopted solution to this problem considers the definition of a pooling process ,

such as the one described in [4], where a limited set of results, gathered from

different clone detection tools, is manually cross-checked. However, the effect of

such a procedure is that there is no guarantee of completeness and so only an

underestimation of the actual performance can be provided. In addition to that,

these data are not so effective for training new algorithms, since they present a

bias towards the clones detected by the solutions used to produce the data set.

In this scenario, only unsupervised Machine Learning, i.e. clustering, can be

proposed. However, clustering can not be expected to be accurate enough for this

application, as the learning process is only guided by the similarity computation.

As an alternative, we propose to construct the training data by simulation, and

then to use them to train a classifier for clone detection. The simulation process

artificially produces clones by modifying parts of the input project and injects

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home