Databases Reference
In-Depth Information
14.2 Approach
Our approach was based on the following key insights:
1. Programming language tutorial pages contain an distinctive combination of
source code snippets and natural language.
2. The natural language on the pages can be used as metadata for the code snippets.
3. Effective searches for snippets need to make use of both the source code and the
natural language text.
The architecture of Juicy is divided into two parts, a back end that works offline
and a interactive front end. These parts are depicted in Fig. 14.1 below.
Fig. 14.1 Architecture of Juicy, a Java code snippet search engine
The back end consists of a repository built on top of Apache Lucene, a text search
engine library written in Java [ 8 ]. Our contributions consist of the techniques for
populating the repository with code snippets, and for creating metadata and indexes.
The front end provides a user interface to the repository through a web interface.
14.3 Populating the Repository
We populated the repository by using a web crawler to collect web pages from the
Internet. In populating the repository for a snippet search engine there are basically
three issues that need to be considered: (1) what sites to crawl; (2) how to exclude
pages that do not contain Java code; and (3) extracting code snippets from the web
page. We will discuss each of these in this section.
Search WWH ::




Custom Search