Database Reference
In-Depth Information
Ta b l e 1 . A sample set of documents
Document ID
Document contents
1
information retrieval
2
Specifying complex queries
3
information on information retrieval
Ta b l e 2 . Inverted index on documents in
Table 1
Ta b l e
3 .
Inverted
index
with
position
information
Keyword
Documents
Keyword
Documents with position
information
1,3
information
1
<
1
>
,3
<
1,3
>
retrieval
1,3
retrieval
1 < 2 > ,3 < 4 >
Specifying
2
Specifying
2 < 1 >
complex
2
complex
2
<
2
>
queries
2
queries
2 < 3 >
on
3
queries, we need to compute the distance between two given patterns, and also the rela-
tive order of occurrence of these patterns. For example, a query such as “information”
NEAR/2 “retrieval” cannot be answered using information from such an index, because
the distance between occurrences of “information” and “retrieval” within a given doc-
ument needs to be computed. This distance cannot be computed given just the document
which the patterns belong to. The position of every occurrence of the keyword within
a document must also be provided by the index [8]. Table 3 shows an inverted index
generated on the documents in Table 1 with the position information stored.
Hence, InfoSearch needs at least the document ID and the position of a given key-
word from the index with which it is integrated, in order to detect complex patterns.
One of the main goals of this work was to assess whether this information is sufficient
to enable complex pattern detection over an index, if the same patterns can be detected
by reading the data source in sequence.
2.4
Pattern Detection Graphs
Patterns are detected using a data structure called Pattern Detection Graph (PDG). A
query submitted to InfoSearch is converted into a PDG. Leaf nodes of the PDG corre-
spond to simple patterns such as keywords, phrases or system defined patterns. Internal
nodes correspond to complex patterns and encapsulate the logic of the corresponding
operator. For example, the PDG corresponding to the pattern “Protein” FOLLOWED
BY “clustering” is shown in Figure 2. The input to a leaf node is a set corresponding to
the index lookup for the term or phrase represented by the leaf node. This set consists
of < docID, start offset, end offset > tuples . As shown in the figure, “protein” occurs
once at offset 10 in document 1 and “clustering” occurs once at offset 12 in document
1. As another example, the set of tuples for the keyword “information” from the index
shown in Table 3 is: 1 < 1,1 > ,3 < 1,1 > and 3 < 3,3 > .
 
Search WWH ::




Custom Search