Information Technology Reference
In-Depth Information
5.3 Architecture and Deployment
We designed both indexing and search as parallel distributed services. Figure 5.3
shows a typical deployment scenario, with an indexing service on the left and a
search service on the right. A typical node in each of the diagrams would is a dual
processor (e.g., 2.8+GHz Xeon 1U) machine with 4GB of RAM and two 120GB
drives.
The Indexing Service (left) processes documents in parallel. Index workers access
source documents from external web servers. Multiple index workers can run on each
node. Each index worker performs all the “Annotation Engine” analyses described in
Figure 5.1. An index manager orchestrates the indexing process across many index
workers. The results of all analyses are stored in temporary indices in the index
workers. At configurable intervals, the index manager orchestrates the merging of
all temporary indices into the partition index components.
A partition index hosts the actual disk based indices used for searching. The
contents of a document corpus are broken up into one or more subsets that are
each stored in a partition index. The system supports multiple partition indices: the
exact number will depend on corpus size, number of queries per second and desired
response time. Indices are queried in parallel and are heavily IO bound. Partition
indices are attached to the leaf nodes of the Search Service on the right.
In addition to storing results in a temporary index, index workers can also store
the raw results of parsing in a Database Management System (DBMS). The database
is used almost exclusively to restore a partition index in the event of index corrup-
tion. Data storage requirements on the DBMS range between 0.5 and 6x corpus
size depending on which recovery options for the InFact system are enabled. Once
a document has been indexed and merged into a partition index it is available for
searching.
In a typical search deployment, queries are sent from a client application; the
client application may be a Web browser or a custom application built using the
Search API. Requests arrive over HTTP and are passed through a Web Server to
the Search Service layer and on to the top searcher of a searcher tree. Searchers
are responsible for searching one or more partition index. Multiple searchers are
supported and can be stacked in a hierarchical tree configuration to enable searching
large data sets. The top level searcher routes ontology related requests to one or more
ontology searchers, which can run on a single node. Search requests are passed to
child searchers, which then pass the request down to one or more partition indices.
The partition index performs the actual search against the index, and the result
passes up the tree until it arrives back at the client for display to the user.
If a particular segment of data located in a partition index is very popular and
becomes a search bottleneck, it may be cloned; the parent searcher will load bal-
ance across two or three partition indices. In addition, if ontology searches become
a bottleneck, more ontology searchers may be added. If a searcher becomes a bot-
tleneck, more searchers can be added. The search service and Web server tier may
be replicated, as well, if a load balancer is used.
The example in Figure 5.3 is an example of a large-scale deployment. In the
GlobalSecurity.org portal, we currently need only four nodes to support a user com-
munity of 100,000 against a corpus of several GB of international news articles,
which are updated on a daily basis.
Search WWH ::




Custom Search