Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge - page 122

Information Technology Reference

In-Depth Information

3 State-of-the-Art of Clone Detection Techniques

In this section we summarize research in the area of clone detection, grouping the

proposals according to the features they exploit to identify similarities among

software artifacts. Note that our goal here is not to provide an extensive analysis

of the clone detection approaches presented in the literature but to provide an

overview of most important techniques together with a general background on

the problem, necessary to introduce the proposal presented in Section 4. An

exhaustive survey of clone detection tools and techniques is provided in [40].

Tabl e 2. Overview of clone detection techniques

Approach

Used Infor m ation

Technique

Ducasse et al. [14]

Textual

String matching

Johnson [22]

Baker [2]

Pattern matching

Token

Kamiya et al. [23]

Sux-tree matching

Yang [48]

Dynamic Programming

Baxter et al. [3]

Tree Matching

Koschke et al. [27]

Syntactic

Sux-tree AST

Bulychev et al. [6]

Anti-unification (NLP)

Jiang et al. [21]

LSH

Komondoor and Horwitz [25]

PDG Slicing

Krinke [28]

Structural

PDG Heuristics

Gabel et al. [17]

PDG Slicing

Leitao [32]

Software metrics

Wahler et al. [45]

Frequent Item-sets

Corazza et al. [9]

Combined

Tree Kernels (ML)

Roy and Cordy [39]

Code Transformation

Textual Based Approaches: Ducasse et al. [14] propose a language-

independent approach to detect code clones, based on line-based string matching

and visual presentation of the cloned code. A different approach is presented by

Johnson [22] where the author applies a string matching technique based on fin-

gerprints to identify exact repetitions of text in the source code of large software

systems.

The main feature of these techniques relies in their eciency and scalability,

easily applicable to the analysis of large software systems. However, their de-

tection capabilities are very limited and only restricted to very similar textual

duplications (line by line). As a matter of fact these approaches are scarcely

usable in practice.

Token Based Approaches: Baker [2] suggests an approach to identify dupli-

cations and near-duplications (i.e., copies with slightly modifications) in large

software systems. The proposed approach finds source code copies that are sub-

stantially the same except for global substitutions. Similarly, Kamiya et al. [23]

Next Page

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home