Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

identifies the largest L 1 -norm in the candidate set and scans subsequent

lists using that bound. If a string already appears in the candidate set

its similarity is updated. If a string does not appear in the candidate

set (this is either a new string or an already pruned string) then the

string is inserted in the candidate set if and only if its L 1 -norm satisfies

the recomputed bounds based on v .

To reduce the topic-keeping cost of the candidate set, the algorithm

stores the set as a linked list, sorted primarily in increasing order of

L 1 -norm and secondarily in increasing order of string identifiers. Then

the algorithm can do a merge join of any token list and the candidate

list very eciently in one pass. The complete algorithm is shown as

Algorithm 6.1.4.

The biggest advantage of the heaviest first algorithm is that it

can benefit from long sequential accesses, one list at a time. Notice that

both the nra and the multiway merge strategies have to access lists

in parallel. For traditional disk based storage devices, as the query size

increases and the number of lists that need to be accessed in parallel

increases, a larger buffer is required in order to prefetch entries from

all the lists and the seek time increases, moving from one list to the

next (this is not an issue for solid state drives or for main memory

processing).

Notice that the heaviest first strategy has another advantage

when token weights are assigned according to token popularity, as is

the case for idf weights. In that case, the heaviest tokens are rare tokens

that correspond to the shortest lists and the heaviest first strategy

is equivalent to scanning the lists in order of their length, shortest list

first. The advantage is that this keeps the size of the candidate set to a

minimum, since as the algorithm advances to lighter tokens, the proba-

bility of newly encountered strings (strings that do not share any of the

heavy tokens with the query) will satisfy the similarity threshold sig-

nificantly decreases. Such strings are therefore immediately discarded.

In addition, the algorithm terminates with high probability before long

lists have to be examined exhaustively.

This observation leads to an alternative strategy that uses a

combination of sequential accesses for short lists and random accesses

for long lists. The alternative algorithm is similar to the heaviest

Search WWH ::

Custom Search

Home