Databases Reference
In-Depth Information
thesis. The tokenizer is sufficient preprocessing for any system that treats the source
code as plain text, as was described in the previous section. In order to extract the
structural elements, however, the tokens need to be fed in to the parser, the second
component. The parser, based on the syntax of the language, builds a tree-based
representation of the text, called an abstract syntax tree (AST). Parsers can either
be written by hand, or automatically generated based on a formal specification of
the language syntax. Given the complexity of the language syntax for many popular
languages, it is generally much easier to reuse an existing parser than develop one
from scratch. For a more complete picture of how compiler front-ends function, we
recommend Aho et al.'s Compilers: Principles, Techniques, and Tools [ 2 ].
11.5 Link-Based Retrieval
In addition to being highly structured, source code also contains a large amount
of semantic information. Given that source code is designed to be understood by
computers, source code retrieval systems can leverage this semantic information
much more easily than general text retrieval systems. In this section, we will discuss
how links, a specific form of semantic information present in source code, can be
used to improve source code retrieval.
The concept of link analysis first came to prominence with the internet, where
Google showed that it could be used to great advantage with its PageRank algorithm.
In the case of web pages, links are the unidirectional hyperlinks where a webpage
directs its readers to another webpage. The insight behind PageRank was that the
number of incoming links to a webpage could (iteratively) be used as proxy measure
of that webpage's quality.
In the domain of software, links are slightly different. Rather than being an ex-
plicit hyperlink, a link instead manifests as a name, which is reference to some type
or method defined elsewhere in the code. For example, if a method creates a local
variable with type ByteArrayOutputStream , then there is a semantic link be-
tween that local variable and the declaration of ByteArrayOutputStream .One
common form of link is the method call. When method calls are connected together,
one gets a call-graph, a commonly used link structure in programming language
analysis.
Link information can be used in a number of different ways to improve source
code retrieval. One popular approach is to adapt PageRank to source code retrieval.
This can be found in systems such as Sourcerer [ 9 ], Portfolio [ 11 ]andSpars-J[ 16 ].
Another approach is to use link information to improve the presentation of results
or give users more detailed information on retrieved code. CodeGenie [ 8 ] and Code
Conjurer [ 7 ], for example, both use link information to extract executable snippets
of code for reuse, a technique which will be discussed in the next section. SpotWeb
[ 14 ], by contrast, counts links in order to give developers an idea of how popular
individual methods in a library are.
Search WWH ::




Custom Search