Databases Reference
In-Depth Information
but if a hit is missing then this is viewed as a bug. For example, when one of our
users searches for all callers of a particular function call in their company's source
code, they typically don't know about every single source file where that API is used
(otherwise they wouldn't need us), but they certainly do know of many files which
should be part of the result set. And if that “well known hit” is missing, then we've
got big problems.
So where did we run into this situation? When files are very large, the default
setting for Nutch was to only process the first 10 K terms. This in general is OK for
web pages, but completely fails the query test when dealing with source code. Hell
hath no fury like a developer who doesn't find a large file they know should be a hit,
because the search term only exists near the end.
Another example is where we miss-classified a file, for example, if file xxx.h was
a C++ header versus a C header. When the user filters search results by programming
language, this can exclude files that they know of and are expecting to see in the
result set.
There wasn't a silver bullet for this problem, but we did manage to catch a lot
of problems once we figured out ways to feed our data back on itself. For example,
we'd take a large, random selection of source files from the http://www.krugle.org
site, and generate a list of all possible multi-line (“code snippet”) searches in a
variety of sizes (e.g. 1-10 lines). We'd then verify that for every one of these code
snippets, we got a hit in the original source document.
6.5.7 Post Mortem
The Krugle enterprise search system is in active use today at a mixture of Fortune
100 and mid-size technology companies. The major benefits seen by customers are:
(a) increased code re-use, primarily at the project level; and (b) a decrease in time
spent fixing the same piece of code that exists in multiple projects.
A major challenge has been to provide potential customers with a way to quantify
potential benefits. There's a general perception that search is important, e.g. it's easy
to agree with statements like “If you can't find it, you can't fix it.” It's difficult,
though, to determine how much time and money such a system would save, and
thus whether investing in a Krugle system is justified.
One additional and unexpected hurdle has been integration of Krugle systems
into existing infrastructure, primarily for authentication and authorization. Many
large enterprise customers are very sensitive about who can access source code,
and even between groups in the same company a lack of trust means that provid-
ing enterprise-wide access control that all parties accept often leads to protracted
engagements with significant profession services overhead.
Acknowledgements Portions of this chapter were adapted from a case study written by the author
that was previously published in the topic “Lucene In Action, 2nd Edition” by Michael McCand-
less, Erik Hatcher, and Otis Gospodnetic.
Search WWH ::




Custom Search