Krugle Code Search Architecture - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

We also extracted information about source code repositories during the crawl,

which allowed us to build a large list of CVS and SVN repositories for the source

code crawler.

6.3.3 Web Page Searching

Finally, we used Nutch's search support (built on top of Lucene) to support search-

ing these web pages. The actual indexes were stored on multiple page searchers,

since (at that time) a typical 4 core box with 8 GB of ram could comfortably han-

dled 10-20 M pages, and our index was bigger than that. Nutch provided the support

to distribute a search request to multiple searchers, each with a slice (“shard”) of the

index, then combine the results.

6.3.4 Source Code Crawling

The source code crawler was also based on Nutch. We added “protocol handlers”

for CVS and SVN, which let us leverage the distributed fetching and parsing support

that was built into Nutch.

The “crawlDB” for the source code crawler contained HTTP-based URLs to

SVN and CVS repositories. We manually entered many of these URLs that were

found via manual searching, but we also included URLs that were discovered dur-

ing the web page crawl as described previously. Finally, the project processing code

(see below) also provided us with repository information.

One of the challenges we ran into was deciding whether to only get the trunk of

project's code, or some number of the tags and branches as well. Initially we just

went after the trunk, but eventually we settled on logic that would fetch the trunk,

plus the “most interesting” tags, which we defined as being the latest point release

for each major and minor version. Thus we'd go for the 1.0.4, 2.0 and 2.1.3 tags, but

not 1.0.1, 1.0.2, 1.0.3, 2.1.0, etc.

We also found out quickly that we needed to be careful about monitoring and

constraining the load that we put on CVS and SVN repositories. Due to a bug in

the code, we accidentally wound up trying to download the complete Apache.org

SVN repository—the trunk and all branches and tags from every project. This was

crushing their infrastructure, and the ops team at Apache wisely blocked our crawler

IP addresses, to prevent melt down. After some groveling and negotiations, plus

more unit tests, we were unblocked and could resume the crawl at a more reasonable

rate.

For some of the larger repositories, we looked into mirroring them, and eventu-

ally did set up an rsync of several. Unfortunately we were never able to negotiate

a mirroring agreement with SourceForge, which was the biggest single repository

that we needed to crawl. And the total number of unique repositories (more 100)

Search WWH ::

Custom Search

Home