Databases Reference
In-Depth Information
able to build the project in order to parse it, and that in turn means you wind up
with a system that requires constant care and feeding to keep it running. Often that
doesn't happen, so the end result is shelfware.
Early on we made a key decision, that we had to be able to process files indi-
vidually, without knowledge of build settings, compiler versions, etc. We also had
to handle a wide range of languages. This in turn meant that the type of parsing we
could do was constrained by what features we could extract from a very fuzzy parse.
We couldn't build a symbol table, for example, as that would require processing all
of the includes/imports.
Depending on the language, the level of single-file parsing varies widely. Python,
Java and C# are examples of languages where you can generate a good parse tree,
while C/C++ are at the other end of the spectrum. Languages such as C/C++ that
supports macros and conditional compilation are especially challenging. Dynamic
languages like Ruby and Perl create their own unique problems, as the meaning of
a term (is it a variable or a function) sometimes isn't determined until run-time.
So what we wind up with a best guess, where we're right most of the time but
we'll occasionally get it wrong.
We use ANTLR to handle most of our parsing needs. Terr Parr, the author of
ANTLR, added some memoization support to version 3.0, which allowed us to use
fairly flexible lexer rules without paying a huge performance penalty for frequent
back-tracking.
6.5.5 Substring Searching
The second important thing we learned from our beta testing was that we had to sup-
port some form of substring searching. For example, when a user searches on “ldap”
she expects to find documents containing terms like “getLDAPConfig” , “ldapTime-
out” , and “find_details_with_ldap” .
We could treat every search term as if it had implicit wildcards, like “*ldap*” ,
but that is both noisy and slow. The noise (false positive hits) comes from treating
all contiguous runs of characters as potential matches, so a search for “heap” finds
a term like “theAPI” .
The performance hit comes from having to: (a) first enumerate all terms in the
index to find any that contain <term> as a substring, and then (b) use the resulting
set of matching terms in a (potentially very large) OR query. BooleanQuery allows a
maximum of 1,024 clauses by default—searching on the Lucene mailing list shows
many people have encountered this limit while trying to support wildcard queries.
There are a number of approaches to solving the wildcard search problem, some
of which are covered in the “Lucene in Action” topic. For example, you can take
every term and index it using all possible suffix substrings of the text. For example,
“myLDAP” gets indexed as “myl”, “yld”, “lda”, and so on. This then lets you turn a
search for “*ldap*” into “ldap*”, which cuts down on the term enumeration time by
being able to do a binary search for terms starting with “ldap”, versus enumerating
Search WWH ::




Custom Search