Searching for Complex Patterns over Large Stored Information Repositories - Advances in Databases

Database Reference

In-Depth Information

Detecting complex patterns over text streams has been studied and shown to be pos-

sible in [5], in which a suite of complex operators and their algorithms were developed

that detect complex patterns over stream data. Detecting patterns over a dynamic text

source (e.g., news feeds, IP packets) essentially entails streaming the data to detect the

required patterns. In other words, to detect a pattern, the entire data source must be read

(or parsed) every time. This is inefficient, but unavoidable, because of the fast-changing

nature of the data source. Also, if freshness of the search results are important to the

user, it becomes necessary to read the data source every time while processing a query.

However, if the data source is relatively static (e.g., Web repositories), it is unnecessary

to read the entire source each time a pattern is to be detected. The inefficiency will

be exacerbated as the data source grows larger. A better approach would be to build

and leverage an index on the data, as is done by search engines, and the information

in the index could be used for answering queries. Since the index would be computed

off-line, this approach may result in an occasional out-of-date search result. However,

considering that the data source is not frequently updated, we can assume this is accept-

able to the user. For such relatively static data sources, the gains in terms of efficiency

of retrieval that leveraging an index will bring outweigh the slight disadvantage of an

occasional out-of-date result.

The techniques developed for searching complex patterns over streams (as in XML

streams, news feed, stock prices) in [5] makes use of the sequential inflow of patterns by

reading the entire data source to detect a pattern. However, if the same patterns need to

be detected in stored data (as in web repositories) then streaming is very inefficient. It is

more efficient to index the repository (or use an already existing index) to detect the pat-

terns. Indexing will lose the sequence of occurrence of patterns in the data. This order

of occurrence of patterns is the key to detecting patterns based on proximity, contain-

ment, sequence, etc. The main contributions of this paper are to: (i) Identify information

that is needed as part of the index to correctly and efficiently detect complex patterns

as compared to the streaming approach, (ii) Explore the extent of the complexity of

the patterns that can be detected using indexed information, and (iii) Develop efficient

algorithms for index-based pattern detection.

The rest of this paper is organized as follows: Section 2 discusses the semantics of

the InfoSearch operators and Section 3 explains the algorithms used by the operators.

Section 4 explains the design of the InfoSearch System. Section 4.1 explains the imple-

mentation aspects of the system. Section 5 shows detailed experimental results. Section

6 reviews the related work, and Section 7 concludes the paper.

2

Pattern Specification and Detection

The InfoSearch framework discussed in this paper consists of an expressive query lan-

guage (introduced in [5]) through which the user can specify patterns and a pattern

detection engine capable of using the index to retrieve documents. InfoSearch detailed

in this paper has been briefly summarized in [6]. InfoSearch adopts the Pattern Spec-

ification Language (PSL) and its associated parser and pattern validator used in In-

foFilter [5]. The focus of this paper is on the detection of complex patterns over large

document repositories.

Advances in Databases

Search WWH ::

Custom Search

Home