Database Reference
In-Depth Information
Searching for Complex Patterns over Large Stored
Information Repositories
Nikhil Deshpande 1 , Sharma Chakravarthy 1 , and Raman Adaikkalavan 2
1 CSE Department, The University of Texas at Arlington
2 CIS Department, Indiana University South Bend
{ sharma,raman } @cs.iusb.edu
Abstract. Although Information Retrieval (IR) systems, including search en-
gines, have been effective in locating documents that contain specified patterns
from large repositories, they support only keyword searches and queries/patterns
that use Boolean operators. Expressive search for complex text patterns is im-
portant in many domains such as patent search, search on incoming news, and
web repositories. In this paper, we first present the operators and their semantics
for specifying an expressive search. We then investigate the detection of complex
patterns - currently not supported by search engines - using a pre-computed in-
dex, and the type of information needed as part of the index to efficiently detect
such complex patterns. We use an expressive pattern specification language and a
pattern detection graph mechanism that allows sharing of common sub-patterns.
Algorithms have been developed for all the pattern operators using the index to
detect complex patterns efficiently. Experiments have been performed to illus-
trate the scalability of the proposed approach, and its efficiency as compared to a
streaming approach.
Keywords: Information retrieval, Complex patterns, Document search.
1
Introduction
Although current IR systems [1,2,3,4] are convenient for doing keyword searches, in
domains such as federal intelligence, fugitive tracking and searching full-text patent in-
formation, there is a need to detect (or search 1 ) more complex patterns in data sources.
Users in these domains may have more precise requirements in terms of what they are
searching for. They may be searching for patterns that involve term frequency (e.g.,
at least 5 occurrences of the phrase “protein clustering”), proximity with sub-patterns
(e.g., “peptide” near “saccharide”, in any order, within 5 words of each other), sequence
of sub-patterns (e.g., “DNA” followed by “modification”) and so on. Further, the pat-
terns that need to be detected may be arbitrarily complex; that is, they may need to be
specified in terms of other patterns (e.g., (“militant” followed by “bomb”) near “Iraq”,
separated by 5 positions or less). The expressiveness of search/query specification pro-
vided by current IR systems, although satisfactory for general searches, is not adequate
for the above application domains.
This work was supported, in part, by the following NSF grants: IIS-0326505, EIA 0216500,
and IIS 0534611.
1
We use the terms “search” and “detect” interchangeably in this paper.
 
Search WWH ::




Custom Search