Applying Program Analysis to Code Retrieval - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Another technique that uses link information is Structural Semantic Indexing

(SSI) [ 3 ]. SSI uses link information to address the paucity of descriptive words

present in source code. The key idea is the following: code entities that share com-

mon usage of APIs are functionally similar, and can share the terms used to define

each other. Simply put, if both A and B use a common set of APIs in a similar man-

ner, then A and B are semantically related even if their names are different. As such,

a query for A should retrieve B and vice-versa.

11.5.1 Link Extraction

The first step in performing any sort of link analysis is to build the appropriate form

of link graph. In the context of source code retrieval, a link graph contains a set of

nodes, representing structural elements, and a set of edges, representing links. For

example, to represent a call graph, the nodes would represent methods and the edges

would represent method calls.

Link extraction proceeds in a similar manner to the structured term extraction

described in Sect. 11.4.1 . In term extraction, the intermediate form generated by

the compiler front-end is traversed and the terms output along with their associated

structural elements. In link extraction, further analysis is performed on the interme-

diate form in order to identify the desired links. For example, to build a call-graph,

one must process the intermediate form to identify the method being called at every

method call site. The result is called an attributed abstract syntax tree, as the abstract

syntax tree has been attributed with type and reference information.

The main difficulty in building an attributed abstract syntax tree, and hence in

identify links in source code, is that the links are often ambiguous, their exact refer-

ent determinable only when the program is executed.

There are two main forms of ambiguity. First, there is the ambiguity inherent

to static analysis. For example, due to virtual method binding, it is uncomputable

what methods actually get called from a given call site. The standard solution to this

issue is to use conservative approximations. Instead of including only those methods

that actually get called, all methods that might get called are included. In addition

to virtual binding issues, static analysis is unable to consistently handle the use of

reflection. If the name of the method to be called is not available until runtime, then

it cannot be discovered statistically. This form of ambiguity is present in all static

analysis.

The second form of ambiguity comes from nature of the data used in source code

retrieval systems. For static analysis to function properly, it requires a declaratively

complete program, where every name in the program has a corresponding known

declaration. If one attempts to perform static analysis with missing declarations,

most systems simply fail on reaching an unknown name.

Unfortunately, it is not acceptable to limit one's corpus of source code to only

declaratively complete programs, as empirical research has shown that approxi-

mately two-thirds of open source programs are not declaratively complete [ 13 ].

Search WWH ::

Custom Search

Home