Novel and Applied Algorithms in a Search Engine for Java Code Snippets - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

14.2 Approach

Our approach was based on the following key insights:

1. Programming language tutorial pages contain an distinctive combination of

source code snippets and natural language.

2. The natural language on the pages can be used as metadata for the code snippets.

3. Effective searches for snippets need to make use of both the source code and the

natural language text.

The architecture of Juicy is divided into two parts, a back end that works offline

and a interactive front end. These parts are depicted in Fig. 14.1 below.

Fig. 14.1 Architecture of Juicy, a Java code snippet search engine

The back end consists of a repository built on top of Apache Lucene, a text search

engine library written in Java [ 8 ]. Our contributions consist of the techniques for

populating the repository with code snippets, and for creating metadata and indexes.

The front end provides a user interface to the repository through a web interface.

14.3 Populating the Repository

We populated the repository by using a web crawler to collect web pages from the

Internet. In populating the repository for a snippet search engine there are basically

three issues that need to be considered: (1) what sites to crawl; (2) how to exclude

pages that do not contain Java code; and (3) extracting code snippets from the web

page. We will discuss each of these in this section.

Search WWH ::

Custom Search

Home