Database Reference
In-Depth Information
a single, common PDG or sub-PDG for a common expression or sub-expression. While
generating the graph, the graph generator stores the keywords specified in the query in
a keyword buffer. Once the PDG is generated, the graph generator queries the index for
each of the keywords it has stored in its buffer. This is done through the index interface
module, which is responsible for retrieving the “hits” for each keyword from the index.
The detection engine of InfoSearch is designed to be generic and capable of working
with any kind of index. The “hits” are then wrapped into a set of < docID, start offset,
end offset > tuples “tuples” and passed on to the leaf node that represents the keyword.
Leaf nodes propagate their input to their parent nodes. The parent nodes, which cor-
respond to one of the operators, merge their input sets according to the appropriate
semantics.
4.1
Implementation
Whenever the graph generator comes across a token which is a keyword or a phrase,
it stores this token in a Vector object called the keyword buffer . The keywords in the
buffer are passed to the index interface after the PDG construction is complete, whereas
the phrases are passed to the phrase processor . The reason for having a keyword buffer
is that it is essential that the PDG is completely constructed before the index can be
queried for the keywords. If the keywords are passed to the index interface or phrase
processor by the graph generator as and when it pops them off the stack, they will re-
turn the results from the index to the PDG possibly before it is completely constructed.
Thus, the keyword buffer is essential to avoid triggering of PDG nodes by the index
interface while the PDG is being constructed. If the synonyms option is chosen for
any keyword in the query, the graph generator queries the WordNet synonym database
to get synonyms for the keyword. This is done through an API called the Java Word-
Net Library (JWNL) [10]. For each synonym, a leaf node is constructed, and finally a
SYN operator node is constructed which subscribes to the original keyword and all its
synonyms.
The index interface has to provide standard methods to access data from the inte-
grated index, and deliver the results to the pattern detector in a specific format. As
such, it does not matter if the index being integrated is an inverted index, or any other
kind of index, say a B-tree index, as long as an index interface for it is developed. In
other words, if a new index has to be integrated with InfoSearch, an index interface for
that index has to be created which will support the required calls from InfoSearch, and
return data to it in the expected format.
The pattern detection engine is responsible for processing the result sets from the
index. The index interface passes a reference to a Vector of Tuple s corresponding to
a keyword to the leaf node corresponding to the keyword. Internal nodes of the PDG
correspond to one of the operators. They get references to one or more Vector s from
their children and merge them to produce an output Vector . This merging is done as per
the operator semantics described earlier.
For the first release of this system, we built a simple inverted index using Berkeley
DB Java Edition [11], and integrated it with InfoSearch. Since the Berkeley DB API
is in Java, it was convenient to develop an index interface for it, because the rest of
the InfoSearch system was also developed in Java. To create the inverted index, we
 
Search WWH ::




Custom Search