Databases Reference
In-Depth Information
We also extracted information about source code repositories during the crawl,
which allowed us to build a large list of CVS and SVN repositories for the source
code crawler.
6.3.3 Web Page Searching
Finally, we used Nutch's search support (built on top of Lucene) to support search-
ing these web pages. The actual indexes were stored on multiple page searchers,
since (at that time) a typical 4 core box with 8 GB of ram could comfortably han-
dled 10-20 M pages, and our index was bigger than that. Nutch provided the support
to distribute a search request to multiple searchers, each with a slice (“shard”) of the
index, then combine the results.
6.3.4 Source Code Crawling
The source code crawler was also based on Nutch. We added “protocol handlers”
for CVS and SVN, which let us leverage the distributed fetching and parsing support
that was built into Nutch.
The “crawlDB” for the source code crawler contained HTTP-based URLs to
SVN and CVS repositories. We manually entered many of these URLs that were
found via manual searching, but we also included URLs that were discovered dur-
ing the web page crawl as described previously. Finally, the project processing code
(see below) also provided us with repository information.
One of the challenges we ran into was deciding whether to only get the trunk of
project's code, or some number of the tags and branches as well. Initially we just
went after the trunk, but eventually we settled on logic that would fetch the trunk,
plus the “most interesting” tags, which we defined as being the latest point release
for each major and minor version. Thus we'd go for the 1.0.4, 2.0 and 2.1.3 tags, but
not 1.0.1, 1.0.2, 1.0.3, 2.1.0, etc.
We also found out quickly that we needed to be careful about monitoring and
constraining the load that we put on CVS and SVN repositories. Due to a bug in
the code, we accidentally wound up trying to download the complete Apache.org
SVN repository—the trunk and all branches and tags from every project. This was
crushing their infrastructure, and the ops team at Apache wisely blocked our crawler
IP addresses, to prevent melt down. After some groveling and negotiations, plus
more unit tests, we were unblocked and could resume the crawl at a more reasonable
rate.
For some of the larger repositories, we looked into mirroring them, and eventu-
ally did set up an rsync of several. Unfortunately we were never able to negotiate
a mirroring agreement with SourceForge, which was the biggest single repository
that we needed to crawl. And the total number of unique repositories (more 100)
Search WWH ::




Custom Search