Information Technology Reference
In-Depth Information
them by following predefined probability distributions. In this way, the quality
of the training set can be controlled without any need of imposing restrictions
on its size. A Kernel-based classifier is then trained on this data set.
To this aim, we designed and implemented an algorithm able to inject clones
in the source code. In particular this algorithm would allow us to automatically
generate a training set and to apply a more reliable strategy in the definition
of the supervised Kernel learning process. The main core of our clone injection
algorithm is represented by the function InjectClone , whose Pseudocode is
reported in Algorithm 1. This algorithm is able to generate function clones and
to track their location in the source code, thus obtaining a labeled dataset of
clones of the given input Type.
In more details, the algorithm starts its computation by parsing the stream
of source code of the analyzed software system in order to extract all the target
functions (Line 2). Afterwards each function is processed one at a time, deciding
whether or not it has to be cloned (Line 6) and how many clones should be
generated (Line 9). In particular, we consider that each function has a probability
probCloning of being cloned. Moreover, if a function has to be cloned, the number
of clones to generate is randomly chosen according to a geometric probability
distribution with parameter 0 . 5, namely Pr( nCopies )=0 . 5 n (Lines 9 - 11).
Finally, the algorithm invokes the procedures Clone and Inject to perform
the generation and the injection of clones in the source code respectively, and
returns the tracking info of generated data.
The Pseudocode of the Clone procedure is reported in Algorithm 2.
The Clone procedure is able to perform the generation of clones up to Type
4 by employing a set of different procedures to apply specific modifications to
the program text (mutation) of the target function. The invocation of such pro-
cedures is performed in accordance with the Type of the clone to generate. We
are not reporting the Pseudocode of such functions in the current document due
to space limitations.
The first mutation operation is performed by the CopyAndChangeLayout func-
tion (Line 2) that is always applied to the target function, regardless the se-
lected clone Type. This is because all the four definitions of clones allow some
modification in the layout of the program text. The substitution of identifiers
and literals is performed for Type 2 clones up to Type 4 ones, by invoking the
SubstituteIdsAndLiterals procedure (Line 4). In particular, such procedure
processes every literal and identifiers of the input function, each of which has a
probability probSubstituteId of being substituted with a randomly generated
identifier.
When dealing with Type 3 clones, in addition to mutations applied for Type
2, other additional operations should be considered. Indeed, in a Type 3 clone,
two fragments of code may differ also in the statements, that could be added or
removed (Line 7). Therefore, we assigned the same probability (i.e., 1 / 2foreach
operation) to the insertion of a new statement randomly extracted from the
considered software system and the deletion of a statement. Furthermore, we
impose an upper bound to the total number of operations which is a randomly
 
Search WWH ::




Custom Search