Novel and Applied Algorithms in a Search Engine for Java Code Snippets - Finding Source Code on the Web for Remix and Reuse - page 275

Databases Reference

In-Depth Information

14.4 Indexing the Repository

Indexing is the process of identifying a set of keys for looking up a document in a

repository. With text documents, it is common to index all of the terms, excluding

stop words. Choosing what to index and how is an important design decision, be-

cause these keys determine how effectively a document is retrieved from the repos-

itory. Often, adding metadata to the index, such as the URL, tags, or author of the

page, can improve the performance of a search. Simply treating source code snippets

as text documents is not sufficient for a number of reasons. Some terms in source

code are structurally significant, such as identifiers. Also, source code typically con-

tains few keywords that tell you about what the code does. Consequently, additional

processing is needed to ensure that the index contains the appropriate information,

so that the most relevant code snippets are returned in response to a user's query.

Our index contains metadata from three different sources: web page, code snip-

pet, and text as shown in Table 14.2 . For web pages, we included two metadata

fields: url and page title. For code snippets, we included 11 metadata fields: 10 for

different identifier types and 1 for a summary of keywords found in a specific code

snippet. For text, we included one metadata field that has the summary of keywords

found in the text segment associated with a code snippet.

We b p a g e Code snippet

Text

URL

Keywords from code snippet

Keywords from text

Page title Package

Import

Class declaration

Class used

Extending and implementing class

Return type

Method declaration

Method invocation

Variable declaration

Comments

Table 14.2: List of indexable metadata

Information for all the metadata fields were indexed and stored in seperate

columns in Lucene. We indexed words from 43,306 snippets, which were com-

pressed into indexes in Lucene with a total size around 71 MB.

Next Page

Finding Source Code on the Web for Remix and Reuse

Search WWH ::

Custom Search

Home