Searching for Complex Patterns over Large Stored Information Repositories - Advances in Databases

Database Reference

In-Depth Information

and position of sub-patterns is known, can a decision be made whether they can be

combined or not.

The inputs to the operators are sets of tuples containing the document ID, start offset,

and end offset of the corresponding pattern. Each tuple represents a single occurrence

of the corresponding pattern in the document collection. It is assumed that these sets

of tuples are sorted in ascending order of document ID and by start offset within each

document ID. The operators have to process the input sets tuple by tuple. First, they

have to ensure that the tuples to be merged have the same document ID. Second, they

have to determine which tuple is the initiator and which one is the terminator. Tuples

satisfying the criteria of the operator are combined and added to an output set. After the

operator is done processing the input sets, the output set is propagated to its parent.

Due to space limitations, we discuss only the NEAR operator. Please refer to [9] for

other operator and algorithm discussions.

3.1

The NEAR Operator

When the NEAR operator is processing two tuples from the input sets, it has to make a

decision whether the tuples are eligible for combination, and if not, decide which one

to keep and which one to discard. As mentioned earlier, the input tuples may either

be point tuples or interval tuples. To keep the forthcoming discussion generalized, we

assume that the input tuples have both start and end offsets. We now discuss the different

cases possible when we consider two input tuples, and the corresponding actions taken

in each case.

Merging strategy in NEAR: The inputs to the NEAR operator are two sets of tuples

corresponding to the left child and the right child, and an optional distance. Let the left

set be denoted by L and the right set by R . Let the distance be denoted by d . We arbitrar-

ily assign the first tuple from L as initiator , and the first tuple from R as terminator .

Let i s and i e denote the start offset and end offset of the initiator ,and t s and t e denote

the start offset and end offset of the terminator .Let i +1

be the next tuple from the set

which initiator belongs to, and t +1

be the next tuple from the set which terminator

belongs to.

If initiator and terminator do not belong to the same docID, we advance the

pointer which is pointing to a smaller docID. Since the sets are sorted by docID, this

is similar to a sort-merge operation. When initiator and terminator point to tuples

belonging to the same docID, three cases are possible.

Case 1 ( i e <t e ): This means that the assumed initiator ends before the assumed

terminator . The different possibilities are shown in Figure 3. We perform the follow-

ing sequence of actions:

if initiator and terminator overlap then

lookahead 2 to determine new initiator and terminator

go to the beginning of this operation and re-process the new

initiator

and

terminator

2

The Lookahead algorithm is explained below.

Advances in Databases

Search WWH ::

Custom Search

Home