Databases Reference
In-Depth Information
We created a corpus of web pages to evaluate our algorithms by using results
returned by Google search. We issued 16 queries each containing the term “java”
and one of the following keywords from the Java programming language: abstract,
class, double, final, for, if, import, int, interface, long, private, protected, public,
static, void, and while. We downloaded and archived the first 50 results from each
of the searches from Google. We removed 52 duplicate pages from the repository
and 41 pages, because the pages did not contain HTML, e.g. PDF and word pro-
cessor documents. Our final corpus contained 707 diverse web pages, both with and
without Java source code examples. In these pages, there were 471,536 content seg-
ments and 9,796 grouped content segments. For each of these pages, we created by
hand a “gold standard,” or oracle for correct classifications.
The F 1 statistic is the weighted harmonic mean of precision and recall. In our
evaluation, we calculated it separately for both text and source code, but here we
show the generic formula we used to calculate both:
precision
×
recall
F 1 =
2
×
(14.1)
precision
+
recall
Classification accuracy indicates the percentage of segments that the algorithm
correctly classifies contents in text and source code.
number
o f
correctly
classi f ied
segments
Accuracy
=
(14.2)
total
number
o f
segments
We found that F 1 Text = 0.968, F 1 Code = 0.767, and Accuracy = 0.959. It took
14.19 s to train the algorithm and 0.147 s to classify a typical page.
14.3.3 Summary
After completion of this processing, we have a repository of web pages where each
web page has been factored into code snippets and text that can be used as metadata.
The number of pages available after each step is summarized in Table 14.1 .
Total pages downloaded 34,054
Pages with no Java code 21,162
Pages with Java code and text 12,892
Table 14.1: Number of pages after filtering
 
Search WWH ::




Custom Search