Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

have diverged far enough from the true values such that the guarantees

no longer hold, the pending updates are applied in batch.

The propagation algorithm essentially computes lower and upper

bounds between which the weight of individual tokens can vary, such

that a selection query with a well defined reduced threshold θ <θ will

return all true answers (i.e., will not result in any false negatives). The

additional cost is an increased number of false positives. The reduced

threshold θ is a function of θ , the particular weighing scheme, and

the relaxation bounds. The tighter the relaxation bounds chosen, the

smaller the number of false positives becomes.

6.8 Discussion and Related Issues

It is very dicult to determine whether any of the aforementioned

strategies performs better than the rest, across the board. The rela-

tive cost savings between sorting according to string identifiers, sorting

according to L p -norms, or prioritizing lists, heavily depend on a vari-

ety of factors most important of which are: the algorithm used, the

weighing scheme used, the specific query (whether it contains tokens

with very long lists for which the L p -norm filtering will yield significant

pruning), the overall token distribution of the data strings, the storage

medium used to store the lists (e.g., disk drive, main memory, solid

state drive), and the type of compression used for the lists, if any.

Choosing the right algorithm depends on specific application and data

characteristics.

It is important to state here that most list compression algorithms

are designed for compressing string identifiers, which are natural num-

bers. The idea being that natural numbers in sorted order can easily

be represented by delta coding. In delta coding only the first number is

stored, and each consecutive number is represented by its distance to

the previous number. This representation decreases the magnitude of

the numbers stored (since deltas are expected to have small magnitude

when the numbers are in sorted order) and enables very ecient com-

pression. A significant problem with all normalized similarity functions

is that, given arbitrary token weights, token lists have to store real

valued L 1 -norms (or L 2 -norms for Cosine similarity). Lists containing

Search WWH ::

Custom Search

Home