Database Reference
In-Depth Information
the third level of the filter tree we could use a positional filter for every
prefix token. In other words, at the first level we use an interval tree to
partition strings according to norms. Then, for every leaf node of the
interval tree we create a sux tree that indexes only the prefix tokens
of all strings within the L p -norm range of that particular interval tree
leaf node. Finally, at the leaf nodes of the sux trees we create a forest
of B-trees that partitions strings according to the position of the prefix
token corresponding to that particular leaf node. At the last level of
the filter tree we build a forest of inverted indexes containing all the
qualifying strings, as they are filtered out level by level.
For example, a certain path down the tree might correspond to all
strings with L p -norm equal to 100, that contain q -gram 'abc' in their
idf sorted prefix, at absolute position 10 within the string. A query
might have to access several paths down the tree simultaneously (e.g.,
it might have to follow multiple L p -norm intervals). In the end, we
merge the lists corresponding to each candidate path and finally take
the union of resulting strings.
The filter tree can be implemented straightforwardly by construct-
ing one inverted list per token, and then sorting the list in the order
of the filters specified by the filter tree. In the example above, each
inverted list would be sorted, first by the norms of the strings con-
tained in the list, and then by the position of that particular token in
each string. Given a query string, one can identify all candidate strings
by using binary search to locate on every token list the appropriate
norm interval (the norm filter); and subsequently, for every norm the
appropriate positions interval (the positional filter). By examining only
those token lists that correspond to the tokens contained in the idf pre-
fix of the query (the prefix filter), the resulting strings are the strings
that would be visited by the actual filter tree.
The order in which the filters are applied significantly affects the
performance of the filter tree. Ordering depends on query and data
characteristics, as well as on the filtering techniques used. Intuitively,
simpler, easy to evaluate filters should be used as close to the root of the
filter tree as possible. Also, filters that result in non-overlapping parti-
tions of strings should be evaluated first, since these filters reduce the
total amount of candidate paths produced. For example the L p -norm
Search WWH ::




Custom Search