Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

Plagiarism

Finding plagiarized documents tests our ability to find textual similarity. The plagiarizer

may extract only some parts of a document for his own. He may alter a few words and may

alter the order in which sentences of the original appear. Yet the resulting document may

still contain 50% or more of the original. No simple process of comparing documents char-

acter by character will detect a sophisticated plagiarism.

Mirror Pages

It is common for important or popular Web sites to be duplicated at a number of hosts, in

order to share the load. The pages of these mirror sites will be quite similar, but are rarely

identical. For instance, they might each contain information associated with their particu-

lar host, and they might each have links to the other mirror sites but not to themselves. A

related phenomenon is the appropriation of pages from one class to another. These pages

might include class notes, assignments, and lecture slides. Similar pages might change the

name of the course, year, and make small changes from year to year. It is important to be

able to detect similar pages of these kinds, because search engines produce better results if

they avoid showing two pages that are nearly identical within the first page of results.

Articles from the Same Source

It is common for one reporter to write a news article that gets distributed, say through the

Associated Press, to many newspapers, which then publish the article on their Web sites.

Each newspaper changes the article somewhat. They may cut out paragraphs, or even add

material of their own. They most likely will surround the article by their own logo, ads, and

links to other articles at their site. However, the core of each newspaper's page will be the

original article. News aggregators, such as Google News, try to find all versions of such an

article, in order to show only one, and that task requires finding when two Web pages are

textually similar, although not identical. 1

3.1.3

Collaborative Filtering as a Similar-Sets Problem

Another class of applications where similarity of sets is very important is called collabor-

ative filtering , a process whereby we recommend to users items that were liked by other

users who have exhibited similar tastes. We shall investigate collaborative filtering in detail

in Section 9.3 , but for the moment let us see some common examples.

Search WWH ::

Custom Search

Home