Databases Reference
In-Depth Information
14.1 Introduction
Searching for source code on the web has become an integral part of software de-
velopment. We find evidence of this in the creation of search engines specifically
designed to search for source code, such as Strathcona [ 5 ], Mica [ 12 ], Krugle, 1
Koders, 2 Google Code Search, 3 and Sourcerer [ 6 ]. All of these tools take the ap-
proach of gathering together as much source code as possible from open source
hosting sites and making the repository searchable. Unfortunately, these repositories
omit the large number of code snippets that are embedded in web pages through-
out the Internet. Snippets usually consist of a handful of lines of code and do not
necessarily compile. Since snippets differ from components in a number of ways, it
stands to reason that they require a different kind of repository and search engine.
In this chapter, we describe “Juicy,” a search engine for snippets of Java code
and the lessons learned from its implementation. In the design of Juicy, we treated
code snippets as first class objects. When the search engine returns a page of results,
the items consist of an excerpt of the code snippet, a link the originating web page,
and a brief text description. In implementing Juicy, we used many existing tools and
algorithms. Our contribution is in the novel application of these resources and the
resulting assemblage.
We u s e d t h e Rotation Forest machine learning algorithm, as implemented by
Weka 3 to help us label sections of web pages from Java tutorial sites as either text or
source code. The open source project, Lucene , was used as the repository for Juicy.
We u s e d t h e Porter Stemming algorithm to normalize words. The Eclipse AST
parser was used to parse the code snippet. Finally, Latent Dirichelet Allocation
was used to find the most relevant paragraph of text to be used as a short summary
of the snippet.
In addition to leveraging these existing algorithms, we performed some small
empirical investigations to inform our design decisions. We identified appropriate
features to be used in classifying segments of tutorial pages. We found that it was
necessary to filter out many duplicate pages and pages that did not contain code
from our initial crawl of Java language tutorial sites. We found that the best text to
use as metadata for a snippet is the text segment that appears above the snippet. We
found that the best results for a general search were obtained by using only three
indexes: web page title, code snippet, and text segment.
In the remainder of this chapter, we will describe how we used these existing
algorithms and the design decisions that we made in doing so.
1
http://www.krugle.com/ .
2
http://www.koders.com .
3
http://www.google.com/codesearch .
Search WWH ::




Custom Search