Searching for Complex Patterns over Large Stored Information Repositories - Advances in Databases

Database Reference

In-Depth Information

a single, common PDG or sub-PDG for a common expression or sub-expression. While

generating the graph, the graph generator stores the keywords specified in the query in

a keyword buffer. Once the PDG is generated, the graph generator queries the index for

each of the keywords it has stored in its buffer. This is done through the index interface

module, which is responsible for retrieving the “hits” for each keyword from the index.

The detection engine of InfoSearch is designed to be generic and capable of working

with any kind of index. The “hits” are then wrapped into a set of < docID, start offset,

end offset > tuples “tuples” and passed on to the leaf node that represents the keyword.

Leaf nodes propagate their input to their parent nodes. The parent nodes, which cor-

respond to one of the operators, merge their input sets according to the appropriate

semantics.

4.1

Implementation

Whenever the graph generator comes across a token which is a keyword or a phrase,

it stores this token in a Vector object called the keyword buffer . The keywords in the

buffer are passed to the index interface after the PDG construction is complete, whereas

the phrases are passed to the phrase processor . The reason for having a keyword buffer

is that it is essential that the PDG is completely constructed before the index can be

queried for the keywords. If the keywords are passed to the index interface or phrase

processor by the graph generator as and when it pops them off the stack, they will re-

turn the results from the index to the PDG possibly before it is completely constructed.

Thus, the keyword buffer is essential to avoid triggering of PDG nodes by the index

interface while the PDG is being constructed. If the synonyms option is chosen for

any keyword in the query, the graph generator queries the WordNet synonym database

to get synonyms for the keyword. This is done through an API called the Java Word-

Net Library (JWNL) [10]. For each synonym, a leaf node is constructed, and finally a

SYN operator node is constructed which subscribes to the original keyword and all its

synonyms.

The index interface has to provide standard methods to access data from the inte-

grated index, and deliver the results to the pattern detector in a specific format. As

such, it does not matter if the index being integrated is an inverted index, or any other

kind of index, say a B-tree index, as long as an index interface for it is developed. In

other words, if a new index has to be integrated with InfoSearch, an index interface for

that index has to be created which will support the required calls from InfoSearch, and

return data to it in the expected format.

The pattern detection engine is responsible for processing the result sets from the

index. The index interface passes a reference to a Vector of Tuple s corresponding to

a keyword to the leaf node corresponding to the keyword. Internal nodes of the PDG

correspond to one of the operators. They get references to one or more Vector s from

their children and merge them to produce an output Vector . This merging is done as per

the operator semantics described earlier.

For the first release of this system, we built a simple inverted index using Berkeley

DB Java Edition [11], and integrated it with InfoSearch. Since the Berkeley DB API

is in Java, it was convenient to develop an index interface for it, because the rest of

the InfoSearch system was also developed in Java. To create the inverted index, we

Advances in Databases

Search WWH ::

Custom Search

Home