Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

the third level of the filter tree we could use a positional filter for every

prefix token. In other words, at the first level we use an interval tree to

partition strings according to norms. Then, for every leaf node of the

interval tree we create a sux tree that indexes only the prefix tokens

of all strings within the L p -norm range of that particular interval tree

leaf node. Finally, at the leaf nodes of the sux trees we create a forest

of B-trees that partitions strings according to the position of the prefix

token corresponding to that particular leaf node. At the last level of

the filter tree we build a forest of inverted indexes containing all the

qualifying strings, as they are filtered out level by level.

For example, a certain path down the tree might correspond to all

strings with L p -norm equal to 100, that contain q -gram 'abc' in their

idf sorted prefix, at absolute position 10 within the string. A query

might have to access several paths down the tree simultaneously (e.g.,

it might have to follow multiple L p -norm intervals). In the end, we

merge the lists corresponding to each candidate path and finally take

the union of resulting strings.

The filter tree can be implemented straightforwardly by construct-

ing one inverted list per token, and then sorting the list in the order

of the filters specified by the filter tree. In the example above, each

inverted list would be sorted, first by the norms of the strings con-

tained in the list, and then by the position of that particular token in

each string. Given a query string, one can identify all candidate strings

by using binary search to locate on every token list the appropriate

norm interval (the norm filter); and subsequently, for every norm the

appropriate positions interval (the positional filter). By examining only

those token lists that correspond to the tokens contained in the idf pre-

fix of the query (the prefix filter), the resulting strings are the strings

that would be visited by the actual filter tree.

The order in which the filters are applied significantly affects the

performance of the filter tree. Ordering depends on query and data

characteristics, as well as on the filtering techniques used. Intuitively,

simpler, easy to evaluate filters should be used as close to the root of the

filter tree as possible. Also, filters that result in non-overlapping parti-

tions of strings should be evaluated first, since these filters reduce the

total amount of candidate paths produced. For example the L p -norm

Search WWH ::

Custom Search

Home