Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

syntactic and semantic techniques using functions that consider various aspects

of software systems (e.g., similar call sub-graphs, commutative operators, user-

defined equivalences). Differently, Wahler et al. [45] present an approach based

on a data mining technique to detect clones. This approach uses the concept

of frequent item-sets on the XML representation of the software system to be

analyzed. Finally, Roy and Cordy [39] present an approach based on source

transformations and text line comparison to find clones.

4 Clone Detection

As briefly introduced in the previous Section, the definition of clones [3] states

that two code fragments form a clone if they are “similar” according to some

definition of similarity. However, such similarity can be based on the program

text, on the implemented functionality (independent of the text), or on both.

In the literature, all these kinds of code similarities correspond to the following

taxonomy of clones [40]:

Type 1 : An exact copy of consecutive code fragments without modifications

(except for white spaces and comments).

Type 2 : Syntactically identical fragments except for variations in identifiers,

literals, and variable types in addition to Type-1 variations;

Type 3 : Copied fragments with further modifications such as changed, added,

or deleted statements in addition to Type-2 variations.

Type 4 : Code fragments that perform similar functionality but are imple-

mented by different syntactic variants.

According to this classification, only Type 1 clones are represented by exactly

the same set of instructions, while the other three types involve lexical and

syntactic variations between the two fragments. As a consequence, an effective

similarity measure has to combine both the syntactic and lexical information

in order to produce a correct solution. Therefore, the input representation is

the first crucial point to consider when designing a Machine Learning-based

clone detector. In addition, annotated data are needed to train the considered

techniques. These two points will be discussed in depth in the remainder of this

section.

4.1 Code Similarities and Kernel Methods

Kernel Methods [20] have shown to be effective in approaches considering the

similarity between complex input structures. In particular, Tree Kernels have

been widely used in fields where the information is represented by means of

tree-based structures, like Natural Language Processing [37] and Bioinformatics

[44], where they have been applied to Parse- and Phylogenetic-trees respectively.

When dealing with the clone detection problem, an interesting solution could be

to apply Tree Kernels to Abstract Syntax Trees (ASTs) of the source code, as

Search WWH ::

Custom Search

Home