Information Technology Reference
In-Depth Information
use a sux-tree matching algorithm to compute token-by-token matching among
source code fragments. The authors adopt optimization techniques that mainly
normalize token sequences. This is due to the fact that the underlying algorithm
may be expensive when used on large software systems. The main drawback of
these approaches is that they completely disregard the syntactic structure of the
analyzed source code, similarly to textual based techniques. As a consequence,
these solutions may detect a large number of false clones, usually not correspond-
ing to any actual syntactic unit.
Syntactic Based Approaches: Syntactic based approaches exploit the in-
formation provided by Abstract Syntax Trees (AST) to identify similar code
fragments. Such techniques are more robust to modifications in code fragments
than textual and token based technique. However, they may possibly fail in case
modifications concerns the inversion or the substitution of entire code blocks:
the so-called gapped-clones [28]. Yang [48] uses dynamic programming to find
differences between two versions of the same source file. A similar approach is
presented by Baxter et al. [3]. It is based on a tree matching algorithm to com-
pare sub-trees of an AST of a given software system. On the other hand, Koschke
et al. [27] describe an approach to detect clones based on sux trees of serialized
ASTs. The main contribution of this work is that software clones can be identified
in linear time and space. A different approach is presented by Bulychev et al. [6],
where authors propose a clone detection technique based on the anti-unification
algorithm, widely used in Natural Language processing tasks. A novel approach
for detecting similar trees has been presented by Jiang et al. [21] in their tool
Deckard . In their approach, certain characteristic vectors are computed to ap-
proximate the structure of ASTs in a Euclidean space. Locality sensitive hashing
(LSH) is then used to cluster similar vectors using the Euclidean distance metric.
Structural Based Approaches: Structural based approaches gather informa-
tion from control and dependency graphs to identify clones. In particular these
techniques apply algorithms to identify isomorphic sub-graphs within a graph
built considering control and data flow dependencies (i.e., the program depen-
dence graphs, PDG) of the software system to analyze. Komondoor and Horwitz
[25] propose an approach based on program slicing techniques, applied on PDGs.
On the other hand, Krinke [28] propose a heuristic based approach to identify iso-
morphic sub-graphs. More recently, Gabel et al. [17] propose a PDG-based tech-
nique that maps slices of PDGs to syntax subtrees and applies the Deckard clone
detection tool [21]. The main advantage of these techniques is that they do not de-
pend on the particular textual representation of the code, allowing to detect also
functional duplications, in addition to the textual based ones considered by previ-
ous approaches. However the identification of isomorphic sub-graphs is a NP-hard
problem and only approximated solutions may be provided.
Combined Approaches: In the literature techniques that combine different
artifacts representation have been defined. For example, Leitao [32] combines
 
Search WWH ::




Custom Search