Database Reference
In-Depth Information
identifies the largest L 1 -norm in the candidate set and scans subsequent
lists using that bound. If a string already appears in the candidate set
its similarity is updated. If a string does not appear in the candidate
set (this is either a new string or an already pruned string) then the
string is inserted in the candidate set if and only if its L 1 -norm satisfies
the recomputed bounds based on v .
To reduce the topic-keeping cost of the candidate set, the algorithm
stores the set as a linked list, sorted primarily in increasing order of
L 1 -norm and secondarily in increasing order of string identifiers. Then
the algorithm can do a merge join of any token list and the candidate
list very eciently in one pass. The complete algorithm is shown as
Algorithm 6.1.4.
The biggest advantage of the heaviest first algorithm is that it
can benefit from long sequential accesses, one list at a time. Notice that
both the nra and the multiway merge strategies have to access lists
in parallel. For traditional disk based storage devices, as the query size
increases and the number of lists that need to be accessed in parallel
increases, a larger buffer is required in order to prefetch entries from
all the lists and the seek time increases, moving from one list to the
next (this is not an issue for solid state drives or for main memory
processing).
Notice that the heaviest first strategy has another advantage
when token weights are assigned according to token popularity, as is
the case for idf weights. In that case, the heaviest tokens are rare tokens
that correspond to the shortest lists and the heaviest first strategy
is equivalent to scanning the lists in order of their length, shortest list
first. The advantage is that this keeps the size of the candidate set to a
minimum, since as the algorithm advances to lighter tokens, the proba-
bility of newly encountered strings (strings that do not share any of the
heavy tokens with the query) will satisfy the similarity threshold sig-
nificantly decreases. Such strings are therefore immediately discarded.
In addition, the algorithm terminates with high probability before long
lists have to be examined exhaustively.
This observation leads to an alternative strategy that uses a
combination of sequential accesses for short lists and random accesses
for long lists. The alternative algorithm is similar to the heaviest
Search WWH ::




Custom Search