Information Technology Reference
In-Depth Information
proposed by Corazza et al. [9]. In particular, since the sole syntactic information
is not sucient to decide whether two code fragments are clones or not, the
authors proposed to enrich the information present in each (internal) node of
the AST by annotating them with the lexemes gathered from the corresponding
leaf nodes. Authors reported a formal definition of such Tree Kernel function
for clone detection, together with its experimental evaluation. Results showed
that the Tree Kernel approach applied on modified ASTs is able to outperform
a purely AST-based technique in detecting up to Type 3 clones [9].
The main drawback of this approach is that it could not be effectively applied
in the identification of Type 4 clones, as the definition of similarity it embeds
mainly considers the program text of the compared code fragments. Conversely,
to deal with Type 4 clones, the information about the program behavior becomes
particularly relevant. As a matter of fact, most of the clone detection techniques
that are able to detect Type 4 clones [28,25,17] use Program Dependency Graphs
(PDGs) to represent the source code. In a PDG, nodes correspond to simple
statements and control flow predicates, and edges encode data and control de-
pendencies [17].
To this aim, we are currently investigating the opportunity to apply Graph
Kernels to PDGs, to detect meaningful similar subgraphs. However, the main
limitation of such approaches regards the computational effort they require,
which is in fact much larger than what is needed by Tree Kernels. Thus, to
find a good trade-off between such cost and the information considered in the
(sub)graphs comparison, we are focusing on the application of Weighted Decom-
position Kernels (WDK) [35] as they enable to define specific criteria to reduce
the total number of comparisons.
4.2 Automatic Generation of Training Data
A typical problem in developing Machine Learning approaches regards the ne-
cessity to arrange two different sets of annotated data, namely the training and
the assessment sets. This problem is particularly important in the case of clone
detection where these labeled data sets are even more dicult than usual to
produce, since a manual annotation of large systems is infeasible. The generally
adopted solution to this problem considers the definition of a pooling process ,
such as the one described in [4], where a limited set of results, gathered from
different clone detection tools, is manually cross-checked. However, the effect of
such a procedure is that there is no guarantee of completeness and so only an
underestimation of the actual performance can be provided. In addition to that,
these data are not so effective for training new algorithms, since they present a
bias towards the clones detected by the solutions used to produce the data set.
In this scenario, only unsupervised Machine Learning, i.e. clustering, can be
proposed. However, clustering can not be expected to be accurate enough for this
application, as the learning process is only guided by the similarity computation.
As an alternative, we propose to construct the training data by simulation, and
then to use them to train a classifier for clone detection. The simulation process
artificially produces clones by modifying parts of the input project and injects
 
Search WWH ::




Custom Search