Database Reference
In-Depth Information
8.5.1 Set Difference
If each document is represented as a set of words, the set difference measure
can be used to measure the redundancy of a new document. The novelty of a
new document d t is measured by the number of new words in the smoothed
set representation of d t .Ifaword w k occurred frequently in document d t but
less frequently in an old document d j , it is likely that new information not
covered by d j is covered by d t .
Thus we can have the following measure for the novelty of the current
document d t with respect to an old document d j .
d t d j
R ( d t |
d j )=
(8.7)
We are not using the true difference between two sets
d t d j
d t d j
+
here because the words in
d t d j
shouldn't contribute to the novelty of d t .
Different variations of the set representation of a document have been pro-
posed. The simplest approach is to include a word in a set d j if and only
if the document contains the word. An alternative approach is to include
a word in a set representation if and only if the number of times the word
occurs in a document is larger than a threshold. However, some words are
expected to be frequent in a new document because they tend to be frequent
in the corpus, or because they tend to be frequent in all relevant documents.
Stop words such as “the,” “a,” and “and” are examples of words that tend
to be frequent in a corpus. There may also be topic-related stopwords, which
are words that behave like stopwords in relevant documents, even if they are
not stopwords in the corpus as a whole. To compensate for stop words, a
third approach is to smooth a new document's word frequencies with word
counts from all previously seen documents and word counts from all delivered
(presumed relevant) documents (73).
8.5.2 Geometric Distance
If each document is represented as a vector, several different geometric
distance measures, such as Manhattan distance and Cosine distance (31), can
be used to measure the redundancy of a new document.
For example, prior research show that cosine distance, a symmetric mea-
sure related to the angle between two vectors (26), works reasonably well for
redundancy detection. Represent d as a vector d =( w 1 ( d ) ,w 2 ( d ) , .., w K ( d )) T ,
Search WWH ::




Custom Search