Novel and Applied Algorithms in a Search Engine for Java Code Snippets - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Fig. 14.3 Location of relevant text around code snippets

Each text segment that we collected was parsed using simple word delimitors

(e.g. white space, new line) in order to extract all words from the text segment. Due

to the fact that many extracted words are very common and not very helpful for

searching, (e.g. 'a', 'an', 'the'), these words should be removed from the collection

of extracted words. We use a list of stop words 6 to filter them out. The remaining

words are changed to lower case and stemmed using the Porter Stemming Algorithm

[ 9 ]. By ignoring capitalization and reducing each word to its simplest form, we

increase the chances of words being matched with the terms in a user's query.

14.4.2 Indexing Code Snippet Segments

Within an integrated development environment (IDE), programmers often search

for variables, functions, classes, and other programming constructs by name [ 11 ].

Code-specific search engines, such as Krugle, Sourcerer, Google Code Search, and

Koders, also provide this functionality. It stands to reason that a snippet search en-

gine should provide this functionality as well. Snippets tend not to be complete

syntatically correct, nor can they be compiled and linked. Parsing out programming

language constructs is only the beginning. Identifiers usually are not plain English

words, but rather are improvised compounds. In addition, comments can be a useful

source of metadata and deserve further analysis.

Instead of using fuzzy parsers, such as those used in syntax highlighters, but

we tried an approach that has not been used extensively, an incremental compiler.

6

Search WWH ::

Custom Search

Home