Utility-Based Information Distillation - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

regarding evaluation methodology: how can we measure the utility of such

an information distillation system? Existing metrics in standard IR, AF and

ND are insucient, and new solutions must be explored, as we will discuss in

Section 9.4. First, we describe the technical cores of our system.

9.3 Technical Cores

Our system consists of the AF component for incremental learning of query

profiles, the passage retrieval component for estimating the relevance of each

passage with respect to a query profile, the novelty detection component for

assessing the novelty of each passage with respect to the user history, and the

anti-redundancy component for minimizing redundancy among the ranked

passages.

9.3.1 Adaptive Filtering Component

We use a state-of-the-art algorithm in the field - the regularized logistic

regression method which had the best results on several benchmark evaluation

corpora for AF (26). Logistic regression (LR) is a supervised learning

algorithm for statistical classification. Based on a training set of labeled

instances, it learns a class model which can then by used to predict the labels

of unseen instances. Its performance as well as eciency in terms of training

time makes it a good candidate when frequent updates are required to the

class model, as is the case in adaptive filtering, where the system must learn

from each new feedback provided by the user. Regularized logistic regression

has the optimization criteria as follows:

n

2

s ( i )log(1+ e −y i wx i )+ λ

w map =argmin

w

||

w

||

i =1

The first term in the objective function is for reducing training-set errors,

where s ( i ) takes three different values (pre-specified constants) for query,

positive and negative documents respectively. This is similar Rocchio where

different weights are given to the three kinds of training examples: topic

descriptions (queries), on-topic documents and off-topic documents. The

second term in the objective function is for regularization ,equivalentto

adding a Gaussian prior to the regression coecients with a zero mean and

covariance variance matrix

1

2 λI where I is the identity matrix.

Tuning

λ (

0) is theoretically justified for reducing model complexity (the effective

degree of freedom) and avoiding over-fitting on training data. The solution of

the modified objective function is called the Maximum A Posteriori (MAP)

estimate, which reduces to the maximum likelihood solution for standard LR

≤

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home