Databases Reference
In-Depth Information
Fig. 14.3 Location of relevant text around code snippets
Each text segment that we collected was parsed using simple word delimitors
(e.g. white space, new line) in order to extract all words from the text segment. Due
to the fact that many extracted words are very common and not very helpful for
searching, (e.g. 'a', 'an', 'the'), these words should be removed from the collection
of extracted words. We use a list of stop words 6 to filter them out. The remaining
words are changed to lower case and stemmed using the Porter Stemming Algorithm
[ 9 ]. By ignoring capitalization and reducing each word to its simplest form, we
increase the chances of words being matched with the terms in a user's query.
14.4.2 Indexing Code Snippet Segments
Within an integrated development environment (IDE), programmers often search
for variables, functions, classes, and other programming constructs by name [ 11 ].
Code-specific search engines, such as Krugle, Sourcerer, Google Code Search, and
Koders, also provide this functionality. It stands to reason that a snippet search en-
gine should provide this functionality as well. Snippets tend not to be complete
syntatically correct, nor can they be compiled and linked. Parsing out programming
language constructs is only the beginning. Identifiers usually are not plain English
words, but rather are improvised compounds. In addition, comments can be a useful
source of metadata and deserve further analysis.
Instead of using fuzzy parsers, such as those used in syntax highlighters, but
we tried an approach that has not been used extensively, an incremental compiler.
6
http://www.ranks.nl/resources/stopwords.html .
Search WWH ::




Custom Search