Information Technology Reference
In-Depth Information
provide models and views representing the relationships among different software
components according to a particular set of concerns [13]. However, unlike classes
or packages, this information do not have an explicit representation in the source
code. Moreover, the external documentation is usually not present nor updated.
Therefore, the existing code remains the most updated source of information
to exploit in order to automatically retrieve and reconstruct the architecture of
a software system. Several approaches have been proposed in the literature to
support this task, known as Software Architecture Recovery (SAR) [13]. Many
of these techniques derive architectural views of the subject system from the
source code by applying some clustering analysis techniques to software artifacts,
considered at different levels of granularity (e.g., at classes level) [13]. In this
scenario, one of the challenges is to define a proper similarity measure among
software artifacts in order to exploit their representation and to group together
the most related ones.
Another well known and largely investigated issue in software maintenance is
clone detection: it is focused on the identification of source code duplications.
Software clones might affect the reliability and the maintainability of large soft-
ware systems. For example, errors affecting a fragment of code must be fixed
in everyone of its possible duplications. In general, duplications in source code
is a phenomenon that occurs frequently in large software systems [3]. Reasons
why programmers duplicate code are manifold. The most well known is a com-
mon bad programming practice, i.e., copying and pasting [40], that gives rise to
software clones , or simply clones . However, in addition to simply copying and
pasting fragments of code, programmers usually adapt software copies to the new
context by applying multiple modifications (e.g., adding new statements and re-
naming variables). Thus, similarly to SAR, the computation of the similarity
between source code becomes crucial [40].
In this paper, we explore the possibility to combine different methods gath-
ered from Information Retrieval (IR), Natural Language Processing (NLP) and
Machine Learning (ML) fields to automatically mine information from the source
code for the identification of clones and the recovery of the actually implemented
architecture in a subject software. In particular, we investigate the applica-
tion of Kernel Methods [20,37] to define similarity measures able to exploit the
structural representation of the source code. We used these techniques in the
fields of architecture reconstruction and clone detection because they provide
flexible solutions able to analyze large data set with reasonable computational
requirements.
Paper Structure. In Sections 2 and 3, we provide an extensive state-of-the-art
for SAR and clone detection techniques respectively. In Section 4, we illustrate
our proposal for clone detection. Section 5 describes the case study used in the
evaluation procedure, whose results are reported in Section 6. Plans for applying
kernel machines and advanced structured-output learning approaches for SAR
are discussed in Section 7. Finally, conclusions are drawn in Section 8.
 
Search WWH ::




Custom Search