Novel and Applied Algorithms in a Search Engine for Java Code Snippets - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

We created a corpus of web pages to evaluate our algorithms by using results

returned by Google search. We issued 16 queries each containing the term “java”

and one of the following keywords from the Java programming language: abstract,

class, double, final, for, if, import, int, interface, long, private, protected, public,

static, void, and while. We downloaded and archived the first 50 results from each

of the searches from Google. We removed 52 duplicate pages from the repository

and 41 pages, because the pages did not contain HTML, e.g. PDF and word pro-

cessor documents. Our final corpus contained 707 diverse web pages, both with and

without Java source code examples. In these pages, there were 471,536 content seg-

ments and 9,796 grouped content segments. For each of these pages, we created by

hand a “gold standard,” or oracle for correct classifications.

The F 1 statistic is the weighted harmonic mean of precision and recall. In our

evaluation, we calculated it separately for both text and source code, but here we

show the generic formula we used to calculate both:

precision

×

recall

F 1 =

2

×

(14.1)

precision

+

recall

Classification accuracy indicates the percentage of segments that the algorithm

correctly classifies contents in text and source code.

number

o f

correctly

classi f ied

segments

Accuracy

=

(14.2)

total

number

o f

segments

We found that F 1 Text = 0.968, F 1 Code = 0.767, and Accuracy = 0.959. It took

14.19 s to train the algorithm and 0.147 s to classify a typical page.

14.3.3 Summary

After completion of this processing, we have a repository of web pages where each

web page has been factored into code snippets and text that can be used as metadata.

The number of pages available after each step is summarized in Table 14.1 .

Total pages downloaded 34,054

Pages with no Java code 21,162

Pages with Java code and text 12,892

Table 14.1: Number of pages after filtering

Finding Source Code on the Web for Remix and Reuse

Search WWH ::

Custom Search

Home