Database Reference
In-Depth Information
have diverged far enough from the true values such that the guarantees
no longer hold, the pending updates are applied in batch.
The propagation algorithm essentially computes lower and upper
bounds between which the weight of individual tokens can vary, such
that a selection query with a well defined reduced threshold θ will
return all true answers (i.e., will not result in any false negatives). The
additional cost is an increased number of false positives. The reduced
threshold θ is a function of θ , the particular weighing scheme, and
the relaxation bounds. The tighter the relaxation bounds chosen, the
smaller the number of false positives becomes.
6.8 Discussion and Related Issues
It is very dicult to determine whether any of the aforementioned
strategies performs better than the rest, across the board. The rela-
tive cost savings between sorting according to string identifiers, sorting
according to L p -norms, or prioritizing lists, heavily depend on a vari-
ety of factors most important of which are: the algorithm used, the
weighing scheme used, the specific query (whether it contains tokens
with very long lists for which the L p -norm filtering will yield significant
pruning), the overall token distribution of the data strings, the storage
medium used to store the lists (e.g., disk drive, main memory, solid
state drive), and the type of compression used for the lists, if any.
Choosing the right algorithm depends on specific application and data
characteristics.
It is important to state here that most list compression algorithms
are designed for compressing string identifiers, which are natural num-
bers. The idea being that natural numbers in sorted order can easily
be represented by delta coding. In delta coding only the first number is
stored, and each consecutive number is represented by its distance to
the previous number. This representation decreases the magnitude of
the numbers stored (since deltas are expected to have small magnitude
when the numbers are in sorted order) and enables very ecient com-
pression. A significant problem with all normalized similarity functions
is that, given arbitrary token weights, token lists have to store real
valued L 1 -norms (or L 2 -norms for Cosine similarity). Lists containing
Search WWH ::




Custom Search