Database Reference
In-Depth Information
intersection similarity I ( r, C ). An interesting observation here is that
we can use the partially built compressed inverted index to identify
which existing cluster is the most similar to r . This is accomplished
by simply running a top- k selection query on the compressed inverted
index with query string r . In this case the top- k algorithm needs to
identify a cluster that contains fewer than A strings. We can identify
the top- k matches incrementally until a suitable cluster C is found.
In addition, if
<N a new
cluster is created instead. Threshold φ can be computed as a function
of the average number of strings per cluster and the average number
of tokens per string in S
I
( r, C ) , for a given threshold φ and
|C|
R . Otherwise, r is assigned to cluster C .
A pointer to C is inserted in all the inverted lists of the compressed
inverted index corresponding to tokens in r that are not already con-
tained in Λ C . The algorithm also stores one record ( r,C ) per string
r in a hash table, which is needed in the querying phase. Finally, the
algorithm scans all strings s
S and for each s queries the compressed
inverted index to find all qualifying clusters C such that
θ .
These are the clusters containing strings that need to be joined with s .
The algorithm stores one record ( C, s ) per qualifying cluster in a hash
table. A detail here is that instead of computing the weighted intersec-
tion between a string and the clusters, we can use Jaccard similarity
without affecting the correctness of the algorithm. Jaccard similarity
will prevent large clusters from getting too large too fast.
In the second phase the algorithm creates one fine-grained inverted
index per cluster and computes the join results of each string s
I
( s, C )
S
that needs to join with this cluster, by using the string/cluster and
qualifying-cluster/string hash tables built in phase one. The com-
pressed inverted index can be discarded at this stage. The algorithm
uses the available memory M to build an inverted index for a set of
clusters C ∈C that fits in main memory. Each set of clusters is treated
as one join partition.
Notice that the algorithm described above cannot be extended for
Jaccard, Dice, or Cosine similarity. The main reason is that the idea is
based on a monotonicity property: If
I
( v,C ) then there does not
exist s
θ . It is easy to see that this property is not
true for Jaccard, Dice, or Cosine similarity.
C s.t.
I
( v,s )
Search WWH ::




Custom Search