Database Reference
In-Depth Information
finding similar customers and similar products. In order to turn the problem of textual sim-
ilarity of documents into one of set intersection, we use a technique called “shingling,”
which is introduced in Section 3.2 .
3.1.1
Jaccard Similarity of Sets
The Jaccard similarity of sets S and T is | S T |/| S T |, that is, the ratio of the size of the
intersection of S and T to the size of their union. We shall denote the Jaccard similarity of
S and T by SIM( S, T ).
EXAMPLE 3.1 In Fig. 3.1 we see two sets S and T . There are three elements in their in-
tersection and a total of eight elements that appear in S or T or both. Thus, SIM( S, T ) =
3/8.
Figure 3.1 Two sets with Jaccard similarity 3/8
3.1.2
Similarity of Documents
An important class of problems that Jaccard similarity addresses well is that of finding tex-
tually similar documents in a large corpus such as the Web or a collection of news articles.
We should understand that the aspect of similarity we are looking at here is character-level
similarity, not “similar meaning,” which requires us to examine the words in the documents
and their uses. That problem is also interesting but is addressed by other techniques, which
we hinted at in Section 1.3.1 . However, textual similarity also has important uses. Many of
these involve finding duplicates or near duplicates. First, let us observe that testing whether
two documents are exact duplicates is easy; just compare the two documents character-by-
character, and if they ever differ then they are not the same. However, in many applications,
the documents are not identical, yet they share large portions of their text. Here are some
examples:
Search WWH ::




Custom Search