Applying Program Analysis to Code Retrieval - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

importance; structure is another. The name of a class, for example, is likely more

important a term than one randomly selected from the body of a method. Looking

at our example snippet, the method name, readToByteArray , much better cap-

tures the function of the method than keyword null or the parameter name is .If

a developer searches for is null , this method is likely a poor match, despite con-

taining both terms, especially compared to a method named isNull . To achieve

this, the retrieval system must give priority to matches where the terms appear in

method name.

The syntax of the programming language determines the exact set of relevant

structural elements, and so code retrieval systems must either be language-specific

or use a model that captures common elements across multiple languages. Structural

elements can include features like the file names, method names, the contents of

comments, and the bodies of methods.

In order to take structure into account, every instance of every term is annotated

with the structural element from which it came. The ranking system can then weight

a term according to its origin. Thus a term found in a method name can be weighted

differently than a term found in a method body. This weighting is often a simple

linear combination, but more complicated functions can also be used. There is no

simple method for deciding on the relative weights to use in the ranking system.

Often, the weights assigned are determined by a mixture of intuition (what structural

elements should be more important) and experimentation (what actually works).

Automated training approaches can also be used, where the weights are trained on

a set of predetermined queries whose ideal results are manually specified.

Apache Lucene, the open-source platform mentioned in the previous section,

uses a document model that is fundamentally built around this idea of structured

text [ 1 ]. Each document in Lucene is a collection of fields, each of which contains

a collection of terms. Each structural element can be directly mapped to a Lucene

field, and so using Lucene each structural element can be associated with the terms

that originate there. Lucene then supports numerous ways of weighting the fields

when performing the ranking.

11.4.1 Term Extraction

A basic form of static analysis is required for associating terms with structural el-

ements. Such a term extraction system system must be aware of the programming

language syntax, and bears many similarities to the front-end of a compiler; it must

take plain text that conforms to the language specification and convert it to an inter-

mediate form. In a compiler, this intermediate form is then optimized and lowered

to the output language. In term extraction, this intermediate form is traversed and

terms output with their associated elements.

There are two primary components to any compiler front-end. First, there is the

tokenizer, which breaks the original text into tokens. The tokenizer typically splits

the original text on white space plus some special characters, like braces and paren-

Search WWH ::

Custom Search

Home