Database Reference
In-Depth Information
For that purpose, we sort token lists in increasing (or decreasing)
order of the L 1 -norms of strings, rather than in string identifier order.
(Without loss of generality, in the rest we assume that the inverted lists
are always sorted in increasing order of L p -norms.) Using Lemma 6.1,
we can directly skip over all candidate strings with L 1 -norm
s
1 <
θ
1 is the L 1 -norm of the query) and stop scanning
a list whenever we encounter the first candidate string with L 1 -norm
v
1 (where
v
1 > v θ .
Given that the lists are not sorted in increasing string identifier
order, the obvious choice is to use the classic threshold based algo-
rithms to compute the similarity of each string. Threshold algorithms
utilize special terminating conditions that enable processing to stop
before exhaustively scanning all token lists. This is easy to show if
the similarity function is a monotone aggregate function. Let α i ( s, v )=
W ( λ i
s
)
max( s 1 ,v 1 ) be the partial similarity of data string s and query token
λ i . Then,
( s, v )= λ i ∈s∩v α i ( s, v ). It can be shown that
N
N
( s, v )is
( s ,v )if
a monotone aggregate function, i.e.,
N
( s, v )
≤N
i
[1 ,m ]:
α i ( s ,v ). The proof is straightforward.
There are three threshold algorithm flavors. The first scans lists
sequentially and in a round robin fashion and computes the similarity
of strings incrementally. The second uses random accesses to compute
similarity aggressively every time a new string identifier is encountered
in one of the lists. The third uses combination strategies of both sequen-
tial and random accesses.
Since the sequential round robin algorithm computes similarity
incrementally, it has to temporarily store in main memory informa-
tion about strings whose similarity has only been partially computed.
Hence, the algorithm has a high book-keeping cost, given that the can-
didate set needs to be maintained as a hash table organized by string
identifiers. The aggressive random access based algorithm computes the
similarity of strings in one step and hence does not need to store any
information in main memory. Hence it has a very small book-keeping
cost, but on the other hand it has to perform a large number of ran-
dom accesses to achieve this. This could be a drawback on traditional
storage devices (like hard drives) but unimportant in modern solid
α i ( s, v )
 
Search WWH ::




Custom Search