Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

source code represent a key source of information. In particular, such techniques

mine relevant information from source code identifiers and comments based on

the assumption that related artifacts are those that share the same vocabulary.

The approach proposed by Kuhn et al. [29] constitutes one of the first propos-

als in this direction defining an automatic technique based on the application

of the Latent Semantic Indexing (LSI) method [11]. The approach is language

independent and mines the lexical information gathered from source code com-

ments. In addition, the approach enables software engineers to identify topics in

the source code by means of labeling of the identified clusters.

Similarly, Risi et al. [38] propose an approach that uses the LSI and the k-

means clustering algorithm to form groups of software entities that implement

similar functionality. A variant based on fold-in and fold-out is introduced as well.

Furthermore this proposal provides an important contribution on the analysis

of computational costs necessary to assess the validity of a clustering process.

Corazza et al. [8] propose a clustering based approach that considers the

source code text as structured in different zones providing different relevance of

information. In particular, the relevance of each zone is automatically weighted

thanks to the definition of a probabilistic generative model and the application

of the Expectation-Maximization (EM) algorithm. Related artifacts are then

grouped accordingly using a customization of the k-medoids clustering algorithm.

More recently the same authors propose an investigation on the effectiveness of

the EM algorithm in combination with different code zones [7] and different

clustering algorithms [10].

Structural and Lexical Based Approaches: Maletic and Marcus in [33]

propose an approach based on the combination of lexical and structural informa-

tion to support comprehension tasks within the maintenance and reengineering

of software systems. From the lexical point of view they consider problem and

development domains. On the other hand, the structural dimension refers to the

actual syntactic structure of the program along with the control and dataflow

that it represents. Software entities are compared using LSI, while file organiza-

tion is used to get structural information. To group programs in clusters a simple

graph theoretic algorithm is used. The algorithm takes as input an undirected

graph (the graph obtained computing the cosine similarity of the two vector

representations of all the source code documents) and then constructs a Mini-

mal Spanning Tree (MST). Clusters are identified pruning the edges of the MST

with a weight larger than a given threshold. To assess the effectiveness of the

approach some case studies on a version of Mosaic are presented and discussed.

Scanniello et al. [42] present a two phase approach for recovering hierarchi-

cal software architectures of object oriented software systems. The first phase

uses structural information to identify software layers [41]. To this end, a cus-

tomization of the Kleinberg algorithm [24] is used. The second phase uses lexical

information extracted from the source code to identify similarity among pairs

of classes and then partitions each identified layer into software modules. The

main limitation of this approach is that it is only suitable for software systems

exhibiting a classical tiered architecture.

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home