Recommendation Systems - Mining of Massive Datasets

Database Reference

In-Depth Information

we suggest pages a user might want to see? Likewise, blogs could be recommended to in-

terested users, if we could classify blogs by topics.

Unfortunately, these classes of documents do not tend to have readily available inform-

ation giving features. A substitute that has been useful in practice is the identification of

words that characterize the topic of a document. How we do the identification was outlined

in Section 1.3.1 . First, eliminate stop words - the several hundred most common words,

which tend to say little about the topic of a document. For the remaining words, compute

the TF.IDF score for each word in the document. The ones with the highest scores are the

words that characterize the document.

We may then take as the features of a document the n words with the highest TF.IDF

scores. It is possible to pick n to be the same for all documents, or to let n be a fixed per-

centage of the words in the document. We could also choose to make all words whose

TF.IDF scores are above a given threshold to be a part of the feature set.

Now, documents are represented by sets of words. Intuitively, we expect these words

to express the subjects or main ideas of the document. For example, in a news article,

we would expect the words with the highest TF.IDF score to include the names of people

discussed in the article, unusual properties of the event described, and the location of the

event. To measure the similarity of two documents, there are several natural distance meas-

ures we can use:

(1) We could use the Jaccard distance between the sets of words (recall Section 3.5.3 ) .

(2) We could use the cosine distance (recall Section 3.5.4 ) between the sets, treated as vec-

tors.

To compute the cosine distance in option (2), think of the sets of high-TF.IDF words as

a vector, with one component for each possible word. The vector has 1 if the word is in

the set and 0 if not. Since between two documents there are only a finite number of words

among their two sets, the infinite dimensionality of the vectors is unimportant. Almost all

components are 0 in both, and 0s do not impact the value of the dot product. To be precise,

the dot product is the size of the intersection of the two sets of words, and the lengths of

the vectors are the square roots of the numbers of words in each set. That calculation lets

us compute the cosine of the angle between the vectors as the dot product divided by the

product of the vector lengths.

9.2.3

Obtaining Item Features From Tags

Let us consider a database of images as an example of a way that features have been ob-

tained for items. The problem with images is that their data, typically an array of pixels,

Search WWH ::

Custom Search

Home