Krugle Code Search Architecture - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Enterprise system and the public site would have been a top-level “master” that

handles distributing search requests and combining the results.

We would still have needed a back-end system to handle crawling web pages and

handling search requests, but that's a much more isolated problem, and one with

better existing support from Nutch.

6.5 Krugle Enterprise

The architecture described above worked well for handling billions of lines of code,

but it wasn't suitable for a stand-alone enterprise product that could run reliably

without daily care and feeding. In addition, we didn't have the commit comment

data from the SCM systems that hosted the project source code, which was a highly

valuable source of information for both searches and analytics.

So we created a workflow system (internally called “the Hub”) that handled the

crawling and processing of data, and converted the original multi-server search sys-

tem into a single-server solution (“the API”).

The enterprise version doesn't support crawling web pages, and it relies on users

manually defining projects—specifying the repository type and location, the de-

scription of the project, etc. This information is still stored in a MySQL database.

6.5.1 SCM Comments

We added support for fetching, parsing and searching SCM comments that we re-

trieved from SCM systems. These comments were stored in the same Solr search

server used for project search, but in a different Solr “core.”

We created a Solr index schema that had the following fields, among others

(Table 6.2 ):

6.5.2 SCMI Architecture

Early on we realized that it would be impossible to install and run all of the many

different types of source code management system (SCM) clients on the enterprise

server. For example, a ClearCase SCM requires a matching client, which in turn has

to be custom installed.

Our solution was to define a standard protocol between the Hub and “helper” ap-

plications that could run on other servers. This SCM interface (SCMI) let us quickly

build connectors to many different SCM systems, including ClearCase, Perforce,

StarTeam, and git, as well as non-SCM sources of information such as Jira and

Bugzilla.

Search WWH ::

Custom Search

Home