Novel and Applied Algorithms in a Search Engine for Java Code Snippets - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

14.1 Introduction

Searching for source code on the web has become an integral part of software de-

velopment. We find evidence of this in the creation of search engines specifically

designed to search for source code, such as Strathcona [ 5 ], Mica [ 12 ], Krugle, 1

Koders, 2 Google Code Search, 3 and Sourcerer [ 6 ]. All of these tools take the ap-

proach of gathering together as much source code as possible from open source

hosting sites and making the repository searchable. Unfortunately, these repositories

omit the large number of code snippets that are embedded in web pages through-

out the Internet. Snippets usually consist of a handful of lines of code and do not

necessarily compile. Since snippets differ from components in a number of ways, it

stands to reason that they require a different kind of repository and search engine.

In this chapter, we describe “Juicy,” a search engine for snippets of Java code

and the lessons learned from its implementation. In the design of Juicy, we treated

code snippets as first class objects. When the search engine returns a page of results,

the items consist of an excerpt of the code snippet, a link the originating web page,

and a brief text description. In implementing Juicy, we used many existing tools and

algorithms. Our contribution is in the novel application of these resources and the

resulting assemblage.

We u s e d t h e Rotation Forest machine learning algorithm, as implemented by

Weka 3 to help us label sections of web pages from Java tutorial sites as either text or

source code. The open source project, Lucene , was used as the repository for Juicy.

We u s e d t h e Porter Stemming algorithm to normalize words. The Eclipse AST

parser was used to parse the code snippet. Finally, Latent Dirichelet Allocation

was used to find the most relevant paragraph of text to be used as a short summary

of the snippet.

In addition to leveraging these existing algorithms, we performed some small

empirical investigations to inform our design decisions. We identified appropriate

features to be used in classifying segments of tutorial pages. We found that it was

necessary to filter out many duplicate pages and pages that did not contain code

from our initial crawl of Java language tutorial sites. We found that the best text to

use as metadata for a snippet is the text segment that appears above the snippet. We

found that the best results for a general search were obtained by using only three

indexes: web page title, code snippet, and text segment.

In the remainder of this chapter, we will describe how we used these existing

algorithms and the design decisions that we made in doing so.

1

http://www.krugle.com/ .

2

http://www.koders.com .

3

Search WWH ::

Custom Search

Home