Databases Reference
In-Depth Information
14.4 Indexing the Repository
Indexing is the process of identifying a set of keys for looking up a document in a
repository. With text documents, it is common to index all of the terms, excluding
stop words. Choosing what to index and how is an important design decision, be-
cause these keys determine how effectively a document is retrieved from the repos-
itory. Often, adding metadata to the index, such as the URL, tags, or author of the
page, can improve the performance of a search. Simply treating source code snippets
as text documents is not sufficient for a number of reasons. Some terms in source
code are structurally significant, such as identifiers. Also, source code typically con-
tains few keywords that tell you about what the code does. Consequently, additional
processing is needed to ensure that the index contains the appropriate information,
so that the most relevant code snippets are returned in response to a user's query.
Our index contains metadata from three different sources: web page, code snip-
pet, and text as shown in Table 14.2 . For web pages, we included two metadata
fields: url and page title. For code snippets, we included 11 metadata fields: 10 for
different identifier types and 1 for a summary of keywords found in a specific code
snippet. For text, we included one metadata field that has the summary of keywords
found in the text segment associated with a code snippet.
We b p a g e Code snippet
Text
URL
Keywords from code snippet
Keywords from text
Page title Package
Import
Class declaration
Class used
Extending and implementing class
Return type
Method declaration
Method invocation
Variable declaration
Comments
Table 14.2: List of indexable metadata
Information for all the metadata fields were indexed and stored in seperate
columns in Lucene. We indexed words from 43,306 snippets, which were com-
pressed into indexes in Lucene with a total size around 71 MB.
 
Search WWH ::




Custom Search