Adaptive Information Filtering - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

When acquiring redundancy judgements and developing algorithms, we as-

sume the redundancy of a new document d t depends on the documents the

user saw before d t arrived. We also assume the documents the user saw before

d t arrived are the set of all documents delivered to the user profile by the time

d t arrives. We use R ( d t )= R ( d t |

D t ) to measure the redundancy of d t .

One approach to novelty/redundancy detection is to cluster all previously

delivered documents D t , and then to measure the redundancy of the current

document d t by its distance to each cluster. This approach would be similar

to solutions for the TDT First Story Detection problem (2). This approach

is sensitive to clustering accuracy, and is based on strong assumptions about

the nature of redundancy.

Another approach is to measure redundancy based on the distance be-

tween the new document and each previously delivered document (document-

document distance). This approach was developed by some researchers who

argue that it may be more robust than clustering, and may be a better match

to how users view redundancy. Because they found that it is easiest for a user

to identify a new document as being redundant with a single previously seen

document, and harder to identify it as being redundant with aset of previ-

ously seen documents. The calculation of R ( d t |

D t ) is simplified by setting it

equal to the value of the maximally similar value in all R ( d t |

d j ).

R ( d t |

D t )= max d j ∈D t R ( d t |

d j )

In the extreme case when d t and d j are exact duplicates ( d t = d j ), it is

obvious that R ( d t |

d j ) should have a high value since a duplicate document is

maximally redundant. One natural way to measure R ( d t |

d j ) is using measures

of similarity/distance/difference between d t and d j .

One practical concern of redundancy estimation is the size of D t could be

very large. To reduce the computation cost during redundancy decisions, D t

can be truncated to the most recent documents delivered for a profile.

One possibly subtle problem characteristic is that redundancy is not a sym-

metric metric. d j may cause d k to be viewed as redundant, but if the presen-

tation order is reversed, d k and d j may both be viewed as containing novel

information. A simple example is a document d k that is a subset (e.g., a

paragraph) of a longer document d j . This problem characteristic motivates

exploration of asymmetric forms of traditional similarity/distance/difference

measures.

Several different approaches to redundancy detection have been proposed

and evaluated (73)(4). The simple set distance measure is designed for

Boolean, set based document models. The geometric distance (cosine sim-

ilarity) measure is a simple metric designed for vector space document mod-

els. Several variations of KL divergence and related smoothing algorithms are

more complex metrics designed to measure differences in probabilistic docu-

ment models.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home