A Case Study in Natural Language Based Web Search - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

5.3 Architecture and Deployment

We designed both indexing and search as parallel distributed services. Figure 5.3

shows a typical deployment scenario, with an indexing service on the left and a

search service on the right. A typical node in each of the diagrams would is a dual

processor (e.g., 2.8+GHz Xeon 1U) machine with 4GB of RAM and two 120GB

drives.

The Indexing Service (left) processes documents in parallel. Index workers access

source documents from external web servers. Multiple index workers can run on each

node. Each index worker performs all the “Annotation Engine” analyses described in

Figure 5.1. An index manager orchestrates the indexing process across many index

workers. The results of all analyses are stored in temporary indices in the index

workers. At configurable intervals, the index manager orchestrates the merging of

all temporary indices into the partition index components.

A partition index hosts the actual disk based indices used for searching. The

contents of a document corpus are broken up into one or more subsets that are

each stored in a partition index. The system supports multiple partition indices: the

exact number will depend on corpus size, number of queries per second and desired

response time. Indices are queried in parallel and are heavily IO bound. Partition

indices are attached to the leaf nodes of the Search Service on the right.

In addition to storing results in a temporary index, index workers can also store

the raw results of parsing in a Database Management System (DBMS). The database

is used almost exclusively to restore a partition index in the event of index corrup-

tion. Data storage requirements on the DBMS range between 0.5 and 6x corpus

size depending on which recovery options for the InFact system are enabled. Once

a document has been indexed and merged into a partition index it is available for

searching.

In a typical search deployment, queries are sent from a client application; the

client application may be a Web browser or a custom application built using the

Search API. Requests arrive over HTTP and are passed through a Web Server to

the Search Service layer and on to the top searcher of a searcher tree. Searchers

are responsible for searching one or more partition index. Multiple searchers are

supported and can be stacked in a hierarchical tree configuration to enable searching

large data sets. The top level searcher routes ontology related requests to one or more

ontology searchers, which can run on a single node. Search requests are passed to

child searchers, which then pass the request down to one or more partition indices.

The partition index performs the actual search against the index, and the result

passes up the tree until it arrives back at the client for display to the user.

If a particular segment of data located in a partition index is very popular and

becomes a search bottleneck, it may be cloned; the parent searcher will load bal-

ance across two or three partition indices. In addition, if ontology searches become

a bottleneck, more ontology searchers may be added. If a searcher becomes a bot-

tleneck, more searchers can be added. The search service and Web server tier may

be replicated, as well, if a load balancer is used.

The example in Figure 5.3 is an example of a large-scale deployment. In the

GlobalSecurity.org portal, we currently need only four nodes to support a user com-

munity of 100,000 against a corpus of several GB of international news articles,

which are updated on a daily basis.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home