Applying Program Analysis to Code Retrieval - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

thesis. The tokenizer is sufficient preprocessing for any system that treats the source

code as plain text, as was described in the previous section. In order to extract the

structural elements, however, the tokens need to be fed in to the parser, the second

component. The parser, based on the syntax of the language, builds a tree-based

representation of the text, called an abstract syntax tree (AST). Parsers can either

be written by hand, or automatically generated based on a formal specification of

the language syntax. Given the complexity of the language syntax for many popular

languages, it is generally much easier to reuse an existing parser than develop one

from scratch. For a more complete picture of how compiler front-ends function, we

recommend Aho et al.'s Compilers: Principles, Techniques, and Tools [ 2 ].

11.5 Link-Based Retrieval

In addition to being highly structured, source code also contains a large amount

of semantic information. Given that source code is designed to be understood by

computers, source code retrieval systems can leverage this semantic information

much more easily than general text retrieval systems. In this section, we will discuss

how links, a specific form of semantic information present in source code, can be

used to improve source code retrieval.

The concept of link analysis first came to prominence with the internet, where

Google showed that it could be used to great advantage with its PageRank algorithm.

In the case of web pages, links are the unidirectional hyperlinks where a webpage

directs its readers to another webpage. The insight behind PageRank was that the

number of incoming links to a webpage could (iteratively) be used as proxy measure

of that webpage's quality.

In the domain of software, links are slightly different. Rather than being an ex-

plicit hyperlink, a link instead manifests as a name, which is reference to some type

or method defined elsewhere in the code. For example, if a method creates a local

variable with type ByteArrayOutputStream , then there is a semantic link be-

tween that local variable and the declaration of ByteArrayOutputStream .One

common form of link is the method call. When method calls are connected together,

one gets a call-graph, a commonly used link structure in programming language

analysis.

Link information can be used in a number of different ways to improve source

code retrieval. One popular approach is to adapt PageRank to source code retrieval.

This can be found in systems such as Sourcerer [ 9 ], Portfolio [ 11 ]andSpars-J[ 16 ].

Another approach is to use link information to improve the presentation of results

or give users more detailed information on retrieved code. CodeGenie [ 8 ] and Code

Conjurer [ 7 ], for example, both use link information to extract executable snippets

of code for reuse, a technique which will be discussed in the next section. SpotWeb

[ 14 ], by contrast, counts links in order to give developers an idea of how popular

individual methods in a library are.

Search WWH ::

Custom Search

Home