Database Reference
In-Depth Information
datasets to be joined, and let an inverted index tailored for all-match
selection queries for weighted intersection similarity for S and R have
size W s and W r respectively. Assume that the largest inverted index
that can fit in main memory has size M , where M<W s <W r . The
first phase builds a compressed inverted index for R that can fit in
main memory. The compressed inverted index is used to identify quali-
fying clusters for each string in S . Subsequently, a fine-grained inverted
index is built on the qualifying clusters and used to answer the join
query.
The compressed inverted index is built by grouping together strings
with several overlapping tokens into clusters. The idea is to use pointers
to the clusters of strings in the inverted index (instead of pointers to the
strings themselves), thereby decreasing the length of the inverted lists.
Let C be a cluster of strings. The cluster is represented by the union of
tokens in all strings s
C . Let Λ C denote the union of tokens of cluster
C . All inverted lists of the compressed inverted index corresponding to
tokens in Λ C contain a pointer to C . To answer a query, first the com-
pressed inverted index is scanned as usual to find all qualifying clusters;
a qualifying cluster needs to have weighted intersection similarity with
the query string larger than the threshold θ (i.e.,
θ ). Then a
fine-grained query is executed on the strings of each qualifying cluster.
Initially, we need to build a good set of string clusters C
I
( v,C )
such that
each cluster in
contains similar strings, and all clusters have approxi-
mately equal sizes. The algorithm first determines the maximum num-
ber of clusters and the average number of strings per cluster that
the compressed inverted index can hold in main memory. The algo-
rithm estimates the number of clusters as N = | R M
W l
C
, and the average
number of strings per cluster as A = N under the assumption that
M
W l (if this assumption is violated, the algorithm can do recur-
sive partitioning).
Next, the clusters are constructed as follows. The algorithm per-
forms a linear scan of R , and for each string r
R , it chooses a clus-
ter C , which can be either an existing cluster with enough overlap with
r or a new cluster if
R
is based on a similarity function between r and that cluster. A simple
definition for the similarity of string r and cluster C is their weighted
|C|
<N . Finding a cluster for each string r
Search WWH ::




Custom Search