A Case Study in Natural Language Based Web Search - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

of the user population will eventually welcome tools that understand a lot more than

present day keyword search does. Better understanding and increased search power

depend on better parameterization of text content in a search engine index. The most

universal storage employed today to capture text content is an inverted index. In

a typical Web search engine, an inverted index may register presence or frequency

or keywords, along with font size or style, and relative location in a Web page.

Obviously this model is only a rough approximation to the complexity of human

language and has the potential to be superseded by future generation of indexing

standards.

InFact relies on a new approach to text parameterization that captures many

linguistic attributes ignored by standard inverted indices. Examples are syntactic

categories (parts of speech), syntactical roles (such as subject, objects, verbs, prepo-

sitional constraints, modifiers, etc.) and semantic categories (such as people, places,

monetary amounts, etc.). Correspondingly, at query time, there are explicit or im-

plicit search operators that can match, join or filter results based on this rich as-

sortment of tags to satisfy very precise search requirements.

The goal of our experiment was to demonstrate that, once scalability barriers

are overcome, a statistically significant percentage of Web users can be converted

from keyword search to natural language based search. InFact has been the search

behind the GlobalSecurity.org site (www.globalsecurity.org) for the past six months.

According to the Alexa site (www.alexa.com), GlobalSecurity.org has a respectable

overall tra c rank (no. 6,751 as of Feb 14, 2006). Users of the site can perform key-

word searches, navigate results by action themes, or enter explicit semantic queries.

An analysis of query logs demonstrate that all these non-standard information dis-

covery processes based on NLP have become increasingly popular over the first six

months of operation.

The remainder of this chapter is organized as follows. Section 5.2 presents an

overview of our system, with special emphasis on the linguistic analyses and new

search logic. Section 5.3 describes the architecture and deployment of a typical

InFact system. Section 5.4 is a study of user patterns and site statistics.

5.2 InFact System Overview

InFact consists of an indexing and a search module. With reference to Figure 5.1, in-

dexing pertains to the processing flow on the bottom of the diagram. InFact models

text as a complex multivariate object using a unique combination of deep pars-

ing, linguistic normalization and e cient storage. The storage schema addresses the

fundamental di culty of reducing information contained in parse trees into gener-

alized data structures that can be queried dynamically. In addition, InFact handles

the problem of linguistic variation by mapping complex linguistic structures into se-

mantic and syntactic equivalents. This representation supports dynamic relationship

and event search, information extraction and pattern matching from large document

collections in real time.

5.2.1 Indexing

With reference to Figure 5.1, InFact's Indexing Service performs in order: 1) docu-

ment processing, 2) clause processing, and 3) linguistic normalization.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home