Databases Reference
In-Depth Information
into one of set intersection, we use a technique called “shingling,” which is
introduced in Section 3.2.
3.1.1
Jaccard Similarity of Sets
The Jaccard similarity of sets S and T is|S∩T|/|S∪T|, that is, the ratio
of the size of the intersection of S and T to the size of their union. We shall
denote the Jaccard similarity of S and T by SIM (S, T ).
Example 3.1 : In Fig. 3.1 we see two sets S and T . There are three elements
in their intersection and a total of eight elements that appear in S or T or both.
Thus, SIM (S, T ) = 3/8.
2
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
S
T
Figure 3.1: Two sets with Jaccard similarity 3/8
3.1.2
Similarity of Documents
An important class of problems that Jaccard similarity addresses well is that
of finding textually similar documents in a large corpus such as the Web or a
collection of news articles. We should understand that the aspect of similarity
we are looking at here is character-level similarity, not “similar meaning,” which
requires us to examine the words in the documents and their uses. That problem
is also interesting but is addressed by other techniques, which we hinted at in
Section 1.3.1. However, textual similarity also has important uses. Many of
these involve finding duplicates or near duplicates. First, let us observe that
testing whether two documents are exact duplicates is easy; just compare the
two documents character-by-character, and if they ever differ then they are not
the same. However, in many applications, the documents are not identical, yet
they share large portions of their text. Here are some examples:
Search WWH ::




Custom Search