Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

intersection similarity I ( r, C ). An interesting observation here is that

we can use the partially built compressed inverted index to identify

which existing cluster is the most similar to r . This is accomplished

by simply running a top- k selection query on the compressed inverted

index with query string r . In this case the top- k algorithm needs to

identify a cluster that contains fewer than A strings. We can identify

the top- k matches incrementally until a suitable cluster C is found.

In addition, if

<N a new

cluster is created instead. Threshold φ can be computed as a function

of the average number of strings per cluster and the average number

of tokens per string in S

I

( r, C ) <φ , for a given threshold φ and

|C|

R . Otherwise, r is assigned to cluster C .

A pointer to C is inserted in all the inverted lists of the compressed

inverted index corresponding to tokens in r that are not already con-

tained in Λ C . The algorithm also stores one record ( r,C ) per string

r in a hash table, which is needed in the querying phase. Finally, the

algorithm scans all strings s

∪

S and for each s queries the compressed

inverted index to find all qualifying clusters C such that

∈

θ .

These are the clusters containing strings that need to be joined with s .

The algorithm stores one record ( C, s ) per qualifying cluster in a hash

table. A detail here is that instead of computing the weighted intersec-

tion between a string and the clusters, we can use Jaccard similarity

without affecting the correctness of the algorithm. Jaccard similarity

will prevent large clusters from getting too large too fast.

In the second phase the algorithm creates one fine-grained inverted

index per cluster and computes the join results of each string s

I

( s, C )

≥

S

that needs to join with this cluster, by using the string/cluster and

qualifying-cluster/string hash tables built in phase one. The com-

pressed inverted index can be discarded at this stage. The algorithm

uses the available memory M to build an inverted index for a set of

clusters C ∈C that fits in main memory. Each set of clusters is treated

as one join partition.

Notice that the algorithm described above cannot be extended for

Jaccard, Dice, or Cosine similarity. The main reason is that the idea is

based on a monotonicity property: If

∈

I

( v,C ) <θ then there does not

exist s

θ . It is easy to see that this property is not

true for Jaccard, Dice, or Cosine similarity.

∈

C s.t.

I

( v,s )

≥

Approximate String Processing

Search WWH ::

Custom Search

Home