Information Technology Reference
In-Depth Information
source code represent a key source of information. In particular, such techniques
mine relevant information from source code identifiers and comments based on
the assumption that related artifacts are those that share the same vocabulary.
The approach proposed by Kuhn et al. [29] constitutes one of the first propos-
als in this direction defining an automatic technique based on the application
of the Latent Semantic Indexing (LSI) method [11]. The approach is language
independent and mines the lexical information gathered from source code com-
ments. In addition, the approach enables software engineers to identify topics in
the source code by means of labeling of the identified clusters.
Similarly, Risi et al. [38] propose an approach that uses the LSI and the k-
means clustering algorithm to form groups of software entities that implement
similar functionality. A variant based on fold-in and fold-out is introduced as well.
Furthermore this proposal provides an important contribution on the analysis
of computational costs necessary to assess the validity of a clustering process.
Corazza et al. [8] propose a clustering based approach that considers the
source code text as structured in different zones providing different relevance of
information. In particular, the relevance of each zone is automatically weighted
thanks to the definition of a probabilistic generative model and the application
of the Expectation-Maximization (EM) algorithm. Related artifacts are then
grouped accordingly using a customization of the k-medoids clustering algorithm.
More recently the same authors propose an investigation on the effectiveness of
the EM algorithm in combination with different code zones [7] and different
clustering algorithms [10].
Structural and Lexical Based Approaches: Maletic and Marcus in [33]
propose an approach based on the combination of lexical and structural informa-
tion to support comprehension tasks within the maintenance and reengineering
of software systems. From the lexical point of view they consider problem and
development domains. On the other hand, the structural dimension refers to the
actual syntactic structure of the program along with the control and dataflow
that it represents. Software entities are compared using LSI, while file organiza-
tion is used to get structural information. To group programs in clusters a simple
graph theoretic algorithm is used. The algorithm takes as input an undirected
graph (the graph obtained computing the cosine similarity of the two vector
representations of all the source code documents) and then constructs a Mini-
mal Spanning Tree (MST). Clusters are identified pruning the edges of the MST
with a weight larger than a given threshold. To assess the effectiveness of the
approach some case studies on a version of Mosaic are presented and discussed.
Scanniello et al. [42] present a two phase approach for recovering hierarchi-
cal software architectures of object oriented software systems. The first phase
uses structural information to identify software layers [41]. To this end, a cus-
tomization of the Kleinberg algorithm [24] is used. The second phase uses lexical
information extracted from the source code to identify similarity among pairs
of classes and then partitions each identified layer into software modules. The
main limitation of this approach is that it is only suitable for software systems
exhibiting a classical tiered architecture.
 
Search WWH ::




Custom Search