Krugle Code Search Architecture - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Table 6.2: Excerpt from Solr index schema for SCM data

6.5.3 Performance

For an enterprise search appliance, a basic issue is how to do two things well at

the same time—updating a live index, and handling search requests. Both tasks can

require extensive CPU, disk and memory resources, so it's easy to wind up with

resource contention issues that kill your performance.

We made three decisions that helped us avoid the above. First, we pushed a signif-

icant amount of work “off the box” by putting a lot of the heavy lifting work into the

hands of small clients called Source Code Management Interfaces (SCMIs). These

run on external customer servers instead of on our appliance, and act as collectors for

information about projects, SCM comments, source code and other development-

oriented information. The information is then partially digested before being sent

back to the appliance via a typical HTTP RESTful protocol.

Second, we use separate JVMs for the data processing/indexing tasks versus the

searching/browsing tasks. This let us better control memory usage, at the cost of

some wasted memory. The Hub data processing JVM receives data from the SCMI

clients, manages the workflow for parsing/indexing/analyzing the results, and builds

a new “snapshot.” This snapshot is a combination of multiple Lucene indexes, plus

all of the content and other analysis results. When a new snapshot is ready, a “flip”

request is sent to the API JVM that handles the search side of things, and this new

snapshot is gracefully swapped in.

On a typical appliance, we have two 32-bit JVMs running, each with 1.5 GB of

memory. One other advantage to this approach is that we can shut down and restart

each JVM separately, which makes it easier to do live upgrades and debug problems.

Search WWH ::

Custom Search

Home