Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

use a sux-tree matching algorithm to compute token-by-token matching among

source code fragments. The authors adopt optimization techniques that mainly

normalize token sequences. This is due to the fact that the underlying algorithm

may be expensive when used on large software systems. The main drawback of

these approaches is that they completely disregard the syntactic structure of the

analyzed source code, similarly to textual based techniques. As a consequence,

these solutions may detect a large number of false clones, usually not correspond-

ing to any actual syntactic unit.

Syntactic Based Approaches: Syntactic based approaches exploit the in-

formation provided by Abstract Syntax Trees (AST) to identify similar code

fragments. Such techniques are more robust to modifications in code fragments

than textual and token based technique. However, they may possibly fail in case

modifications concerns the inversion or the substitution of entire code blocks:

the so-called gapped-clones [28]. Yang [48] uses dynamic programming to find

differences between two versions of the same source file. A similar approach is

presented by Baxter et al. [3]. It is based on a tree matching algorithm to com-

pare sub-trees of an AST of a given software system. On the other hand, Koschke

et al. [27] describe an approach to detect clones based on sux trees of serialized

ASTs. The main contribution of this work is that software clones can be identified

in linear time and space. A different approach is presented by Bulychev et al. [6],

where authors propose a clone detection technique based on the anti-unification

algorithm, widely used in Natural Language processing tasks. A novel approach

for detecting similar trees has been presented by Jiang et al. [21] in their tool

Deckard . In their approach, certain characteristic vectors are computed to ap-

proximate the structure of ASTs in a Euclidean space. Locality sensitive hashing

(LSH) is then used to cluster similar vectors using the Euclidean distance metric.

Structural Based Approaches: Structural based approaches gather informa-

tion from control and dependency graphs to identify clones. In particular these

techniques apply algorithms to identify isomorphic sub-graphs within a graph

built considering control and data flow dependencies (i.e., the program depen-

dence graphs, PDG) of the software system to analyze. Komondoor and Horwitz

[25] propose an approach based on program slicing techniques, applied on PDGs.

On the other hand, Krinke [28] propose a heuristic based approach to identify iso-

morphic sub-graphs. More recently, Gabel et al. [17] propose a PDG-based tech-

nique that maps slices of PDGs to syntax subtrees and applies the Deckard clone

detection tool [21]. The main advantage of these techniques is that they do not de-

pend on the particular textual representation of the code, allowing to detect also

functional duplications, in addition to the textual based ones considered by previ-

ous approaches. However the identification of isomorphic sub-graphs is a NP-hard

problem and only approximated solutions may be provided.

Combined Approaches: In the literature techniques that combine different

artifacts representation have been defined. For example, Leitao [32] combines

Search WWH ::

Custom Search

Home