Database Reference
In-Depth Information
state drives. Combination strategies try to strike a balance between
low book-keeping and a small number of random accesses. A simple
round robin strategy, called nra (for No Random Access algorithm) is
shown in Algorithm 6.1.2.
The algorithm keeps a candidate set M containing tuples consisting
of a string identifier, a partial similarity score, and a bit vector contain-
ing zeros for all query tokens that have not matched with the particular
string identifier yet, and ones for those that a match has been found.
The candidate set is organized as a hash table on string identifiers for
eciently determining whether a given string has already been encoun-
tered or not. Lemma 6.1 states that for each list we only need to scan
elements within a narrow L 1 -norm interval. We skip directly to the
first element in every list with L 1 -norm
1 (assuming that
we are indexing sequences and that lists are primarily sorted by token
positions, we also skip to the first element with token position equal
to, or within some bounds from, the token position in the query; for
simplicity in what follows we ignore any positional information). The
algorithm proceeds by reading one element per list in every iteration,
from each active list. According to Lemma 6.1, if the next element read
from list L ( λ i ) has L 1 -norm larger than v θ the algorithm flags L ( λ i )
as inactive. Otherwise, there are two cases to consider:
s
1
θ
v
If the string identifier read is already contained in the can-
didate set M , its entry is updated to reflect the new partial
similarity score. In addition, the bit vector is updated to
reflect the fact that the candidate string contains the par-
ticular query token. Also, the algorithm checks whether any
other lists have already become inactive, which implies one
of two things: If the candidate string contains the token cor-
responding to that list, then the bit vector has already been
set to one from a previous iteration and the partial similarity
score has been updated to reflect that fact; or the candidate
string does not contain that token, hence the bit vector can
be set to one without updating the partial similarity score
(essentially adding zero to the similarity score).
Search WWH ::




Custom Search