Krugle Code Search Architecture - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

able to build the project in order to parse it, and that in turn means you wind up

with a system that requires constant care and feeding to keep it running. Often that

doesn't happen, so the end result is shelfware.

Early on we made a key decision, that we had to be able to process files indi-

vidually, without knowledge of build settings, compiler versions, etc. We also had

to handle a wide range of languages. This in turn meant that the type of parsing we

could do was constrained by what features we could extract from a very fuzzy parse.

We couldn't build a symbol table, for example, as that would require processing all

of the includes/imports.

Depending on the language, the level of single-file parsing varies widely. Python,

Java and C# are examples of languages where you can generate a good parse tree,

while C/C++ are at the other end of the spectrum. Languages such as C/C++ that

supports macros and conditional compilation are especially challenging. Dynamic

languages like Ruby and Perl create their own unique problems, as the meaning of

a term (is it a variable or a function) sometimes isn't determined until run-time.

So what we wind up with a best guess, where we're right most of the time but

we'll occasionally get it wrong.

We use ANTLR to handle most of our parsing needs. Terr Parr, the author of

ANTLR, added some memoization support to version 3.0, which allowed us to use

fairly flexible lexer rules without paying a huge performance penalty for frequent

back-tracking.

6.5.5 Substring Searching

The second important thing we learned from our beta testing was that we had to sup-

port some form of substring searching. For example, when a user searches on “ldap”

she expects to find documents containing terms like “getLDAPConfig” , “ldapTime-

out” , and “find_details_with_ldap” .

We could treat every search term as if it had implicit wildcards, like “*ldap*” ,

but that is both noisy and slow. The noise (false positive hits) comes from treating

all contiguous runs of characters as potential matches, so a search for “heap” finds

a term like “theAPI” .

The performance hit comes from having to: (a) first enumerate all terms in the

index to find any that contain <term> as a substring, and then (b) use the resulting

set of matching terms in a (potentially very large) OR query. BooleanQuery allows a

maximum of 1,024 clauses by default—searching on the Lucene mailing list shows

many people have encountered this limit while trying to support wildcard queries.

There are a number of approaches to solving the wildcard search problem, some

of which are covered in the “Lucene in Action” topic. For example, you can take

every term and index it using all possible suffix substrings of the text. For example,

“myLDAP” gets indexed as “myl”, “yld”, “lda”, and so on. This then lets you turn a

search for “*ldap*” into “ldap*”, which cuts down on the term enumeration time by

being able to do a binary search for terms starting with “ldap”, versus enumerating

Search WWH ::

Custom Search

Home