Database Reference
In-Depth Information
When acquiring redundancy judgements and developing algorithms, we as-
sume the redundancy of a new document d t depends on the documents the
user saw before d t arrived. We also assume the documents the user saw before
d t arrived are the set of all documents delivered to the user profile by the time
d t arrives. We use R ( d t )= R ( d t |
D t ) to measure the redundancy of d t .
One approach to novelty/redundancy detection is to cluster all previously
delivered documents D t , and then to measure the redundancy of the current
document d t by its distance to each cluster. This approach would be similar
to solutions for the TDT First Story Detection problem (2). This approach
is sensitive to clustering accuracy, and is based on strong assumptions about
the nature of redundancy.
Another approach is to measure redundancy based on the distance be-
tween the new document and each previously delivered document (document-
document distance). This approach was developed by some researchers who
argue that it may be more robust than clustering, and may be a better match
to how users view redundancy. Because they found that it is easiest for a user
to identify a new document as being redundant with a single previously seen
document, and harder to identify it as being redundant with aset of previ-
ously seen documents. The calculation of R ( d t |
D t ) is simplified by setting it
equal to the value of the maximally similar value in all R ( d t |
d j ).
R ( d t |
D t )= max d j ∈D t R ( d t |
d j )
In the extreme case when d t and d j are exact duplicates ( d t = d j ), it is
obvious that R ( d t |
d j ) should have a high value since a duplicate document is
maximally redundant. One natural way to measure R ( d t |
d j ) is using measures
of similarity/distance/difference between d t and d j .
One practical concern of redundancy estimation is the size of D t could be
very large. To reduce the computation cost during redundancy decisions, D t
can be truncated to the most recent documents delivered for a profile.
One possibly subtle problem characteristic is that redundancy is not a sym-
metric metric. d j may cause d k to be viewed as redundant, but if the presen-
tation order is reversed, d k and d j may both be viewed as containing novel
information. A simple example is a document d k that is a subset (e.g., a
paragraph) of a longer document d j . This problem characteristic motivates
exploration of asymmetric forms of traditional similarity/distance/difference
measures.
Several different approaches to redundancy detection have been proposed
and evaluated (73)(4). The simple set distance measure is designed for
Boolean, set based document models. The geometric distance (cosine sim-
ilarity) measure is a simple metric designed for vector space document mod-
els. Several variations of KL divergence and related smoothing algorithms are
more complex metrics designed to measure differences in probabilistic docu-
ment models.
Search WWH ::




Custom Search