Databases Reference
In-Depth Information
of the string times the maximum Jaccard distance (1 minus the minimum
Jaccard similarity).
F
Position Indexes: We can index strings not only on the characters in
their prefixes, but on the position of that character within the prefix. We
reduce the number of pairs of strings that must be compared, because
if two strings share a character that is not in the first position in both
strings, then we know that either there are some preceding characters that
are in the union but not the intersection, or there is an earlier symbol that
appears in both strings.
F
Su x Indexes: We can also index strings based not only on the characters
in their prefixes and the positions of those characters, but on the length
of the character's su x - the number of positions that follow it in the
string. This structure further reduces the number of pairs that must be
compared, because a common symbol with different su x lengths implies
additional characters that must be in the union but not in the intersection.
3.11
References for Chapter 3
The technique we called shingling is attributed to [10]. The use in the manner
we discussed here is from [2]. Minhashing comes from [3]. The original works
on locality-sensitive hashing were [9] and [7]. [1] is a useful summary of ideas
in this field.
[4] introduces the idea of using random-hyperplanes to summarize items in
a way that respects the cosine distance. [8] suggests that random hyperplanes
plus LSH can be more accurate at detecting similar documents than minhashing
plus LSH.
Techniques for summarizing points in a Euclidean space are covered in [6].
[11] presented the shingling technique based on stop words.
The length and prefix-based indexing schemes for high-similarity matching
comes from [5]. The technique involving su x length is from [12].
1. A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approxi-
mate nearest neighbor in high dimensions,” Comm. ACM 51:1, pp. 117-
122, 2008.
2. A.Z. Broder, “On the resemblance and containment of documents,” Proc.
Compression and Complexity of Sequences, pp. 21-29, Positano Italy,
1997.
3. A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, “Min-wise
independent permutations,” ACM Symposium on Theory of Computing,
pp. 327-336, 1998.
4. M.S. Charikar, “Similarity estimation techniques from rounding algo-
rithms,” ACM Symposium on Theory of Computing, pp. 380-388, 2002.
Search WWH ::




Custom Search