Information Technology Reference
In-Depth Information
syntactic and semantic techniques using functions that consider various aspects
of software systems (e.g., similar call sub-graphs, commutative operators, user-
defined equivalences). Differently, Wahler et al. [45] present an approach based
on a data mining technique to detect clones. This approach uses the concept
of frequent item-sets on the XML representation of the software system to be
analyzed. Finally, Roy and Cordy [39] present an approach based on source
transformations and text line comparison to find clones.
4 Clone Detection
As briefly introduced in the previous Section, the definition of clones [3] states
that two code fragments form a clone if they are “similar” according to some
definition of similarity. However, such similarity can be based on the program
text, on the implemented functionality (independent of the text), or on both.
In the literature, all these kinds of code similarities correspond to the following
taxonomy of clones [40]:
Type 1 : An exact copy of consecutive code fragments without modifications
(except for white spaces and comments).
Type 2 : Syntactically identical fragments except for variations in identifiers,
literals, and variable types in addition to Type-1 variations;
Type 3 : Copied fragments with further modifications such as changed, added,
or deleted statements in addition to Type-2 variations.
Type 4 : Code fragments that perform similar functionality but are imple-
mented by different syntactic variants.
According to this classification, only Type 1 clones are represented by exactly
the same set of instructions, while the other three types involve lexical and
syntactic variations between the two fragments. As a consequence, an effective
similarity measure has to combine both the syntactic and lexical information
in order to produce a correct solution. Therefore, the input representation is
the first crucial point to consider when designing a Machine Learning-based
clone detector. In addition, annotated data are needed to train the considered
techniques. These two points will be discussed in depth in the remainder of this
section.
4.1 Code Similarities and Kernel Methods
Kernel Methods [20] have shown to be effective in approaches considering the
similarity between complex input structures. In particular, Tree Kernels have
been widely used in fields where the information is represented by means of
tree-based structures, like Natural Language Processing [37] and Bioinformatics
[44], where they have been applied to Parse- and Phylogenetic-trees respectively.
When dealing with the clone detection problem, an interesting solution could be
to apply Tree Kernels to Abstract Syntax Trees (ASTs) of the source code, as
 
Search WWH ::




Custom Search