Database Reference
In-Depth Information
we suggest pages a user might want to see? Likewise, blogs could be recommended to in-
terested users, if we could classify blogs by topics.
Unfortunately, these classes of documents do not tend to have readily available inform-
ation giving features. A substitute that has been useful in practice is the identification of
words that characterize the topic of a document. How we do the identification was outlined
in Section 1.3.1 . First, eliminate stop words - the several hundred most common words,
which tend to say little about the topic of a document. For the remaining words, compute
the TF.IDF score for each word in the document. The ones with the highest scores are the
words that characterize the document.
We may then take as the features of a document the n words with the highest TF.IDF
scores. It is possible to pick n to be the same for all documents, or to let n be a fixed per-
centage of the words in the document. We could also choose to make all words whose
TF.IDF scores are above a given threshold to be a part of the feature set.
Now, documents are represented by sets of words. Intuitively, we expect these words
to express the subjects or main ideas of the document. For example, in a news article,
we would expect the words with the highest TF.IDF score to include the names of people
discussed in the article, unusual properties of the event described, and the location of the
event. To measure the similarity of two documents, there are several natural distance meas-
ures we can use:
(1) We could use the Jaccard distance between the sets of words (recall Section 3.5.3 ) .
(2) We could use the cosine distance (recall Section 3.5.4 ) between the sets, treated as vec-
tors.
To compute the cosine distance in option (2), think of the sets of high-TF.IDF words as
a vector, with one component for each possible word. The vector has 1 if the word is in
the set and 0 if not. Since between two documents there are only a finite number of words
among their two sets, the infinite dimensionality of the vectors is unimportant. Almost all
components are 0 in both, and 0s do not impact the value of the dot product. To be precise,
the dot product is the size of the intersection of the two sets of words, and the lengths of
the vectors are the square roots of the numbers of words in each set. That calculation lets
us compute the cosine of the angle between the vectors as the dot product divided by the
product of the vector lengths.
9.2.3
Obtaining Item Features From Tags
Let us consider a database of images as an example of a way that features have been ob-
tained for items. The problem with images is that their data, typically an array of pixels,
Search WWH ::




Custom Search