Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

provide models and views representing the relationships among different software

components according to a particular set of concerns [13]. However, unlike classes

or packages, this information do not have an explicit representation in the source

code. Moreover, the external documentation is usually not present nor updated.

Therefore, the existing code remains the most updated source of information

to exploit in order to automatically retrieve and reconstruct the architecture of

a software system. Several approaches have been proposed in the literature to

support this task, known as Software Architecture Recovery (SAR) [13]. Many

of these techniques derive architectural views of the subject system from the

source code by applying some clustering analysis techniques to software artifacts,

considered at different levels of granularity (e.g., at classes level) [13]. In this

scenario, one of the challenges is to define a proper similarity measure among

software artifacts in order to exploit their representation and to group together

the most related ones.

Another well known and largely investigated issue in software maintenance is

clone detection: it is focused on the identification of source code duplications.

Software clones might affect the reliability and the maintainability of large soft-

ware systems. For example, errors affecting a fragment of code must be fixed

in everyone of its possible duplications. In general, duplications in source code

is a phenomenon that occurs frequently in large software systems [3]. Reasons

why programmers duplicate code are manifold. The most well known is a com-

mon bad programming practice, i.e., copying and pasting [40], that gives rise to

software clones , or simply clones . However, in addition to simply copying and

pasting fragments of code, programmers usually adapt software copies to the new

context by applying multiple modifications (e.g., adding new statements and re-

naming variables). Thus, similarly to SAR, the computation of the similarity

between source code becomes crucial [40].

In this paper, we explore the possibility to combine different methods gath-

ered from Information Retrieval (IR), Natural Language Processing (NLP) and

Machine Learning (ML) fields to automatically mine information from the source

code for the identification of clones and the recovery of the actually implemented

architecture in a subject software. In particular, we investigate the applica-

tion of Kernel Methods [20,37] to define similarity measures able to exploit the

structural representation of the source code. We used these techniques in the

fields of architecture reconstruction and clone detection because they provide

flexible solutions able to analyze large data set with reasonable computational

requirements.

Paper Structure. In Sections 2 and 3, we provide an extensive state-of-the-art

for SAR and clone detection techniques respectively. In Section 4, we illustrate

our proposal for clone detection. Section 5 describes the case study used in the

evaluation procedure, whose results are reported in Section 6. Plans for applying

kernel machines and advanced structured-output learning approaches for SAR

are discussed in Section 7. Finally, conclusions are drawn in Section 8.

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home