Databases Reference
In-Depth Information
importance; structure is another. The name of a class, for example, is likely more
important a term than one randomly selected from the body of a method. Looking
at our example snippet, the method name, readToByteArray , much better cap-
tures the function of the method than keyword null or the parameter name is .If
a developer searches for is null , this method is likely a poor match, despite con-
taining both terms, especially compared to a method named isNull . To achieve
this, the retrieval system must give priority to matches where the terms appear in
method name.
The syntax of the programming language determines the exact set of relevant
structural elements, and so code retrieval systems must either be language-specific
or use a model that captures common elements across multiple languages. Structural
elements can include features like the file names, method names, the contents of
comments, and the bodies of methods.
In order to take structure into account, every instance of every term is annotated
with the structural element from which it came. The ranking system can then weight
a term according to its origin. Thus a term found in a method name can be weighted
differently than a term found in a method body. This weighting is often a simple
linear combination, but more complicated functions can also be used. There is no
simple method for deciding on the relative weights to use in the ranking system.
Often, the weights assigned are determined by a mixture of intuition (what structural
elements should be more important) and experimentation (what actually works).
Automated training approaches can also be used, where the weights are trained on
a set of predetermined queries whose ideal results are manually specified.
Apache Lucene, the open-source platform mentioned in the previous section,
uses a document model that is fundamentally built around this idea of structured
text [ 1 ]. Each document in Lucene is a collection of fields, each of which contains
a collection of terms. Each structural element can be directly mapped to a Lucene
field, and so using Lucene each structural element can be associated with the terms
that originate there. Lucene then supports numerous ways of weighting the fields
when performing the ranking.
11.4.1 Term Extraction
A basic form of static analysis is required for associating terms with structural el-
ements. Such a term extraction system system must be aware of the programming
language syntax, and bears many similarities to the front-end of a compiler; it must
take plain text that conforms to the language specification and convert it to an inter-
mediate form. In a compiler, this intermediate form is then optimized and lowered
to the output language. In term extraction, this intermediate form is traversed and
terms output with their associated elements.
There are two primary components to any compiler front-end. First, there is the
tokenizer, which breaks the original text into tokens. The tokenizer typically splits
the original text on white space plus some special characters, like braces and paren-
Search WWH ::




Custom Search