Krugle Code Search Architecture - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

but if a hit is missing then this is viewed as a bug. For example, when one of our

users searches for all callers of a particular function call in their company's source

code, they typically don't know about every single source file where that API is used

(otherwise they wouldn't need us), but they certainly do know of many files which

should be part of the result set. And if that “well known hit” is missing, then we've

got big problems.

So where did we run into this situation? When files are very large, the default

setting for Nutch was to only process the first 10 K terms. This in general is OK for

web pages, but completely fails the query test when dealing with source code. Hell

hath no fury like a developer who doesn't find a large file they know should be a hit,

because the search term only exists near the end.

Another example is where we miss-classified a file, for example, if file xxx.h was

a C++ header versus a C header. When the user filters search results by programming

language, this can exclude files that they know of and are expecting to see in the

result set.

There wasn't a silver bullet for this problem, but we did manage to catch a lot

of problems once we figured out ways to feed our data back on itself. For example,

we'd take a large, random selection of source files from the http://www.krugle.org

site, and generate a list of all possible multi-line (“code snippet”) searches in a

variety of sizes (e.g. 1-10 lines). We'd then verify that for every one of these code

snippets, we got a hit in the original source document.

6.5.7 Post Mortem

The Krugle enterprise search system is in active use today at a mixture of Fortune

100 and mid-size technology companies. The major benefits seen by customers are:

(a) increased code re-use, primarily at the project level; and (b) a decrease in time

spent fixing the same piece of code that exists in multiple projects.

A major challenge has been to provide potential customers with a way to quantify

potential benefits. There's a general perception that search is important, e.g. it's easy

to agree with statements like “If you can't find it, you can't fix it.” It's difficult,

though, to determine how much time and money such a system would save, and

thus whether investing in a Krugle system is justified.

One additional and unexpected hurdle has been integration of Krugle systems

into existing infrastructure, primarily for authentication and authorization. Many

large enterprise customers are very sensitive about who can access source code,

and even between groups in the same company a lack of trust means that provid-

ing enterprise-wide access control that all parties accept often leads to protracted

engagements with significant profession services overhead.

Acknowledgements Portions of this chapter were adapted from a case study written by the author

that was previously published in the topic “Lucene In Action, 2nd Edition” by Michael McCand-

less, Erik Hatcher, and Otis Gospodnetic.

Search WWH ::

Custom Search

Home