Databases Reference
In-Depth Information
Another technique that uses link information is Structural Semantic Indexing
(SSI) [ 3 ]. SSI uses link information to address the paucity of descriptive words
present in source code. The key idea is the following: code entities that share com-
mon usage of APIs are functionally similar, and can share the terms used to define
each other. Simply put, if both A and B use a common set of APIs in a similar man-
ner, then A and B are semantically related even if their names are different. As such,
a query for A should retrieve B and vice-versa.
11.5.1 Link Extraction
The first step in performing any sort of link analysis is to build the appropriate form
of link graph. In the context of source code retrieval, a link graph contains a set of
nodes, representing structural elements, and a set of edges, representing links. For
example, to represent a call graph, the nodes would represent methods and the edges
would represent method calls.
Link extraction proceeds in a similar manner to the structured term extraction
described in Sect. 11.4.1 . In term extraction, the intermediate form generated by
the compiler front-end is traversed and the terms output along with their associated
structural elements. In link extraction, further analysis is performed on the interme-
diate form in order to identify the desired links. For example, to build a call-graph,
one must process the intermediate form to identify the method being called at every
method call site. The result is called an attributed abstract syntax tree, as the abstract
syntax tree has been attributed with type and reference information.
The main difficulty in building an attributed abstract syntax tree, and hence in
identify links in source code, is that the links are often ambiguous, their exact refer-
ent determinable only when the program is executed.
There are two main forms of ambiguity. First, there is the ambiguity inherent
to static analysis. For example, due to virtual method binding, it is uncomputable
what methods actually get called from a given call site. The standard solution to this
issue is to use conservative approximations. Instead of including only those methods
that actually get called, all methods that might get called are included. In addition
to virtual binding issues, static analysis is unable to consistently handle the use of
reflection. If the name of the method to be called is not available until runtime, then
it cannot be discovered statistically. This form of ambiguity is present in all static
analysis.
The second form of ambiguity comes from nature of the data used in source code
retrieval systems. For static analysis to function properly, it requires a declaratively
complete program, where every name in the program has a corresponding known
declaration. If one attempts to perform static analysis with missing declarations,
most systems simply fail on reaching an unknown name.
Unfortunately, it is not acceptable to limit one's corpus of source code to only
declaratively complete programs, as empirical research has shown that approxi-
mately two-thirds of open source programs are not declaratively complete [ 13 ].
Search WWH ::




Custom Search