Adaptive Information Filtering - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

The problem with using MLE is that if a word never occurs in document d ,

it will get a zero probability ( P ( w k |

d )=0). Thusawordin d t but not in d j

will make KL ( θ d t ,θ d j )=

∞

.

Smoothing techniques are necessary to adjust the maximum likelihood es-

timation so that the KL-based measure is more appropriate. Research shows

that retrieval and filtering performance is highly sensitive to smoothing pa-

rameters when using language models. Several smoothing methods have been

applied to ad hoc information retrieval, text classification problems, and nov-

elty detection (69)(73).

8.5.4 Summary of Novelty Detection

The work described above is focused on the redundancy measure, and it is

somewhat user independent in the sense that our redundancy measures only

calculate a score indicating the degree of redundancy in a document given a

history of delivered documents. They do not actually make a decision as to

whether a document is considered redundant or novel.

A redundancy threshold is needed in order to classify a document as

redundant or novel. When human assessors are asked to make redundancy

decisions given the same topics and document streams, they sometimes dis-

agreed. In some cases the disagreement was based on differences in the as-

sessors' internal definition of redundancy. However, more often one assessor

might feel that a document d t should be considered redundant if a previously

seen document d j covered 80% of d t ; the other assessor might not consider

it redundant unless the coverage was more than 95%. A person's tolerance

for redundancy can be modeled with a user-dependent threshold that con-

verts a redundancy score into a redundancy decision. User feedback about

which documents are redundant can serve as training data. Over time the

system can learn to estimate the probability that a new document with a

given redundancy score would be considered redundant. This probability can

be expressed as P (user j thinks d t is redundant

|

R ( d t |

D t )).

8.6 Other Adaptive Filtering Topics

While learning user profiles is an advantage of a filtering system, it is also a

major research challenge in the adaptive filtering research community. Com-

mon learning algorithms require a significant amount of training data. How-

ever, a real-world filtering system must work as soon as the user uses the sys-

Search WWH ::

Custom Search

Home