Database Reference
In-Depth Information
Plagiarism
Finding plagiarized documents tests our ability to find textual similarity. The plagiarizer
may extract only some parts of a document for his own. He may alter a few words and may
alter the order in which sentences of the original appear. Yet the resulting document may
still contain 50% or more of the original. No simple process of comparing documents char-
acter by character will detect a sophisticated plagiarism.
Mirror Pages
It is common for important or popular Web sites to be duplicated at a number of hosts, in
order to share the load. The pages of these mirror sites will be quite similar, but are rarely
identical. For instance, they might each contain information associated with their particu-
lar host, and they might each have links to the other mirror sites but not to themselves. A
related phenomenon is the appropriation of pages from one class to another. These pages
might include class notes, assignments, and lecture slides. Similar pages might change the
name of the course, year, and make small changes from year to year. It is important to be
able to detect similar pages of these kinds, because search engines produce better results if
they avoid showing two pages that are nearly identical within the first page of results.
Articles from the Same Source
It is common for one reporter to write a news article that gets distributed, say through the
Associated Press, to many newspapers, which then publish the article on their Web sites.
Each newspaper changes the article somewhat. They may cut out paragraphs, or even add
material of their own. They most likely will surround the article by their own logo, ads, and
links to other articles at their site. However, the core of each newspaper's page will be the
original article. News aggregators, such as Google News, try to find all versions of such an
article, in order to show only one, and that task requires finding when two Web pages are
textually similar, although not identical. 1
3.1.3
Collaborative Filtering as a Similar-Sets Problem
Another class of applications where similarity of sets is very important is called collabor-
ative filtering , a process whereby we recommend to users items that were liked by other
users who have exhibited similar tastes. We shall investigate collaborative filtering in detail
in Section 9.3 , but for the moment let us see some common examples.
Search WWH ::




Custom Search