Recommendation Systems - Mining of Massive Datasets

Database Reference

In-Depth Information

does not tell us anything useful about their features. We can calculate simple properties of

pixels, such as the average amount of red in the picture, but few users are looking for red

pictures or especially like red pictures.

There have been a number of attempts to obtain information about features of items by

inviting users to tag the items by entering words or phrases that describe the item. Thus,

one picture with a lot of red might be tagged “Tiananmen Square,” while another is tagged

“sunset at Malibu.” The distinction is not something that could be discovered by existing

image-analysis programs.

Two Kinds of Document Similarity

Recall that in Section 3.4 we gave a method for finding documents that were “similar,” using shingling, minhashing,

and LSH. There, the notion of similarity was lexical - documents are similar if they contain large, identical sequences

of characters. For recommendation systems, the notion of similarity is different. We are interested only in the occur-

rences of many important words in both documents, even if there is little lexical similarity between the documents.

However, the methodology for finding similar documents remains almost the same. Once we have a distance measure,

either Jaccard or cosine, we can use minhashing (for Jaccard) or random hyperplanes (for cosine distance; see Section

3.7.2 ) feeding data to an LSH algorithm to find the pairs of documents that are similar in the sense of sharing many

common keywords.

Tags from Computer Games

An interesting direction for encouraging tagging is the “games” approach pioneered by Luis von Ahn. He enabled two

players to collaborate on the tag for an image. In rounds, they would suggest a tag, and the tags would be exchanged.

If they agreed, then they “won,” and if not, they would play another round with the same image, trying to agree sim-

ultaneously on a tag. While an innovative direction to try, it is questionable whether sufficient public interest can be

generated to produce enough free work to satisfy the needs for tagged data.

Almost any kind of data can have its features described by tags. One of the earliest at-

tempts to tag massive amounts of data was the site del.icio.us, later bought by Yahoo!,

which invited users to tag Web pages. The goal of this tagging was to make a new method

of search available, where users entered a set of tags as their search query, and the system

retrieved the Web pages that had been tagged that way. However, it is also possible to use

the tags as a recommendation system. If it is observed that a user retrieves or bookmarks

many pages with a certain set of tags, then we can recommend other pages with the same

tags.

The problem with tagging as an approach to feature discovery is that the process only

works if users are willing to take the trouble to create the tags, and there are enough tags

that occasional erroneous ones will not bias the system too much.

Search WWH ::

Custom Search

Home