Database Reference
In-Depth Information
Detecting complex patterns over text streams has been studied and shown to be pos-
sible in [5], in which a suite of complex operators and their algorithms were developed
that detect complex patterns over stream data. Detecting patterns over a dynamic text
source (e.g., news feeds, IP packets) essentially entails streaming the data to detect the
required patterns. In other words, to detect a pattern, the entire data source must be read
(or parsed) every time. This is inefficient, but unavoidable, because of the fast-changing
nature of the data source. Also, if freshness of the search results are important to the
user, it becomes necessary to read the data source every time while processing a query.
However, if the data source is relatively static (e.g., Web repositories), it is unnecessary
to read the entire source each time a pattern is to be detected. The inefficiency will
be exacerbated as the data source grows larger. A better approach would be to build
and leverage an index on the data, as is done by search engines, and the information
in the index could be used for answering queries. Since the index would be computed
off-line, this approach may result in an occasional out-of-date search result. However,
considering that the data source is not frequently updated, we can assume this is accept-
able to the user. For such relatively static data sources, the gains in terms of efficiency
of retrieval that leveraging an index will bring outweigh the slight disadvantage of an
occasional out-of-date result.
The techniques developed for searching complex patterns over streams (as in XML
streams, news feed, stock prices) in [5] makes use of the sequential inflow of patterns by
reading the entire data source to detect a pattern. However, if the same patterns need to
be detected in stored data (as in web repositories) then streaming is very inefficient. It is
more efficient to index the repository (or use an already existing index) to detect the pat-
terns. Indexing will lose the sequence of occurrence of patterns in the data. This order
of occurrence of patterns is the key to detecting patterns based on proximity, contain-
ment, sequence, etc. The main contributions of this paper are to: (i) Identify information
that is needed as part of the index to correctly and efficiently detect complex patterns
as compared to the streaming approach, (ii) Explore the extent of the complexity of
the patterns that can be detected using indexed information, and (iii) Develop efficient
algorithms for index-based pattern detection.
The rest of this paper is organized as follows: Section 2 discusses the semantics of
the InfoSearch operators and Section 3 explains the algorithms used by the operators.
Section 4 explains the design of the InfoSearch System. Section 4.1 explains the imple-
mentation aspects of the system. Section 5 shows detailed experimental results. Section
6 reviews the related work, and Section 7 concludes the paper.
2
Pattern Specification and Detection
The InfoSearch framework discussed in this paper consists of an expressive query lan-
guage (introduced in [5]) through which the user can specify patterns and a pattern
detection engine capable of using the index to retrieve documents. InfoSearch detailed
in this paper has been briefly summarized in [6]. InfoSearch adopts the Pattern Spec-
ification Language (PSL) and its associated parser and pattern validator used in In-
foFilter [5]. The focus of this paper is on the detection of complex patterns over large
document repositories.
 
Search WWH ::




Custom Search