Information Technology Reference
In-Depth Information
of the user population will eventually welcome tools that understand a lot more than
present day keyword search does. Better understanding and increased search power
depend on better parameterization of text content in a search engine index. The most
universal storage employed today to capture text content is an inverted index. In
a typical Web search engine, an inverted index may register presence or frequency
or keywords, along with font size or style, and relative location in a Web page.
Obviously this model is only a rough approximation to the complexity of human
language and has the potential to be superseded by future generation of indexing
standards.
InFact relies on a new approach to text parameterization that captures many
linguistic attributes ignored by standard inverted indices. Examples are syntactic
categories (parts of speech), syntactical roles (such as subject, objects, verbs, prepo-
sitional constraints, modifiers, etc.) and semantic categories (such as people, places,
monetary amounts, etc.). Correspondingly, at query time, there are explicit or im-
plicit search operators that can match, join or filter results based on this rich as-
sortment of tags to satisfy very precise search requirements.
The goal of our experiment was to demonstrate that, once scalability barriers
are overcome, a statistically significant percentage of Web users can be converted
from keyword search to natural language based search. InFact has been the search
behind the GlobalSecurity.org site (www.globalsecurity.org) for the past six months.
According to the Alexa site (www.alexa.com), GlobalSecurity.org has a respectable
overall tra c rank (no. 6,751 as of Feb 14, 2006). Users of the site can perform key-
word searches, navigate results by action themes, or enter explicit semantic queries.
An analysis of query logs demonstrate that all these non-standard information dis-
covery processes based on NLP have become increasingly popular over the first six
months of operation.
The remainder of this chapter is organized as follows. Section 5.2 presents an
overview of our system, with special emphasis on the linguistic analyses and new
search logic. Section 5.3 describes the architecture and deployment of a typical
InFact system. Section 5.4 is a study of user patterns and site statistics.
5.2 InFact System Overview
InFact consists of an indexing and a search module. With reference to Figure 5.1, in-
dexing pertains to the processing flow on the bottom of the diagram. InFact models
text as a complex multivariate object using a unique combination of deep pars-
ing, linguistic normalization and e cient storage. The storage schema addresses the
fundamental di culty of reducing information contained in parse trees into gener-
alized data structures that can be queried dynamically. In addition, InFact handles
the problem of linguistic variation by mapping complex linguistic structures into se-
mantic and syntactic equivalents. This representation supports dynamic relationship
and event search, information extraction and pattern matching from large document
collections in real time.
5.2.1 Indexing
With reference to Figure 5.1, InFact's Indexing Service performs in order: 1) docu-
ment processing, 2) clause processing, and 3) linguistic normalization.
Search WWH ::




Custom Search