Database Reference
In-Depth Information
converted into a stream, fed to InfoFilter [5] and patterns involving all the operators
were detected over this document stream. The time taken to process the stream was
noted in milliseconds. Subsequently, the same documents were indexed for testing the
efficiency of InfoSearch and the time taken for the result set to reach the root node was
noted in milliseconds. For each of the operators, the performance of these systems are
noted and are shown in Figure 7. The experiments were again repeated with document
sets of sizes 1MB, 2MB, 5MB, 10MB, 20MB and 50MB and the results are shown in
Figure 8. The last row of the table indicates a pattern whose PDG is height 4 (has 4
operators and at least 5 leaf nodes). From that row it is clear that the improvement for
data sizes as little as 1MB is more than a factor of 100. This is even better with the
improvement being more than a factor of 1000 for data volumes of 50MB. As can be
seen, the detection for index-based algorithm grows sub-linearly whereas the detection
time for the streaming approach grows super-linearly. The time taken to index the doc-
uments is not considered in the above comparison (which is negligible as compared to
the improvements in detection time), since it is a pre-processing step, and is amortized
over multiple searches on the repository.
6
Related Work
Most search engines use a variation of the vector space model [1] to select documents
against a query from a document collection. In addition, search engines try to add other
factors to the ranking process for documents including external (meta) information about
the documents, references to documents from other documents, etc. Google [2] stores
the pages fetched by the crawler in compressed form in a repository. It has a document
index, which is a fixed width ISAM index, to keep information about each document.
It also has a lexicon, forward index and an inverted index to facilitate rapid access to
document lists. However, it support queries only in the form of keywords and Boolean
compositions of keywords. INQUERY [3] is based on a form of the probabilistic re-
trieval model called the inference net. Inference nets [4] provide the capability to spec-
ify complex information needs, and compare them to document representations. The
operators supported by INQUERY include and, or, not , a phrase operator and also an
operator that handles proximity between patterns. In addition, specification of a partic-
ular argument as being more important that the others can be done. However, there are
no operators for sequence of patterns, pattern frequency, synonyms and containment.
7
Conclusions
It was observed that current search systems are somewhat restrictive in the expressive-
ness of patterns that can be specified by the user. InfoSearch facilitates searching of
complex patterns involving proximity, frequency, containment and sequences over a
given document collection. The use of pattern operators and its modified semantics to
provide an expressive pattern specification mechanism and to develop algorithms for an
index-based approach are the main contributions of the paper. We have demonstrated
that there is no loss of detection capability from stream mode to index mode for the
pattern specification language. The overhead of additional information is quite small
 
Search WWH ::




Custom Search