Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

them by following predefined probability distributions. In this way, the quality

of the training set can be controlled without any need of imposing restrictions

on its size. A Kernel-based classifier is then trained on this data set.

To this aim, we designed and implemented an algorithm able to inject clones

in the source code. In particular this algorithm would allow us to automatically

generate a training set and to apply a more reliable strategy in the definition

of the supervised Kernel learning process. The main core of our clone injection

algorithm is represented by the function InjectClone , whose Pseudocode is

reported in Algorithm 1. This algorithm is able to generate function clones and

to track their location in the source code, thus obtaining a labeled dataset of

clones of the given input Type.

In more details, the algorithm starts its computation by parsing the stream

of source code of the analyzed software system in order to extract all the target

functions (Line 2). Afterwards each function is processed one at a time, deciding

whether or not it has to be cloned (Line 6) and how many clones should be

generated (Line 9). In particular, we consider that each function has a probability

probCloning of being cloned. Moreover, if a function has to be cloned, the number

of clones to generate is randomly chosen according to a geometric probability

distribution with parameter 0 . 5, namely Pr( nCopies )=0 . 5 n (Lines 9 - 11).

Finally, the algorithm invokes the procedures Clone and Inject to perform

the generation and the injection of clones in the source code respectively, and

returns the tracking info of generated data.

The Pseudocode of the Clone procedure is reported in Algorithm 2.

The Clone procedure is able to perform the generation of clones up to Type

4 by employing a set of different procedures to apply specific modifications to

the program text (mutation) of the target function. The invocation of such pro-

cedures is performed in accordance with the Type of the clone to generate. We

are not reporting the Pseudocode of such functions in the current document due

to space limitations.

The first mutation operation is performed by the CopyAndChangeLayout func-

tion (Line 2) that is always applied to the target function, regardless the se-

lected clone Type. This is because all the four definitions of clones allow some

modification in the layout of the program text. The substitution of identifiers

and literals is performed for Type 2 clones up to Type 4 ones, by invoking the

SubstituteIdsAndLiterals procedure (Line 4). In particular, such procedure

processes every literal and identifiers of the input function, each of which has a

probability probSubstituteId of being substituted with a randomly generated

identifier.

When dealing with Type 3 clones, in addition to mutations applied for Type

2, other additional operations should be considered. Indeed, in a Type 3 clone,

two fragments of code may differ also in the statements, that could be added or

removed (Line 7). Therefore, we assigned the same probability (i.e., 1 / 2foreach

operation) to the insertion of a new statement randomly extracted from the

considered software system and the deletion of a statement. Furthermore, we

impose an upper bound to the total number of operations which is a randomly

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home