Utility-Based Information Distillation - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

if λ = 0. (See ( 26 ) and ( 29 ) for computational complexity, parameter tuning

and implementation issues.)

In adaptive filtering, each query is considered as a class, and the class

model - a set of regression coecients corresponding to individual terms -

is the query profile as viewed by the system. As for the training set, we use

the query itself as the initial positive training example of the class, and the

user-highlighted pieces of text (marked as Relevant or Not-relevant) during

feedback as additional training examples. To address the cold start issue

in the early stage before any user feedback is obtained, the system uses a

small sample from a retrospective corpus as the initial negative examples in

the training set. The details of using logistic regression for adaptive filtering

(assigning different weights to positive and negative training instances, and

regularizing the objective function to prevent overfitting on training data) are

presented in (26).

The class model w ∗ learned by Logistic Regression, or the query profile, is

a vector whose dimensions are individual terms and whose elements are the

regression coecients, indicating how influential each term is in the query

profile. The query profile is updated whenever a new piece of user feedback

is received. A temporally decaying weight can be applied to each training

example, as an option, to emphasize the most recent user feedback.

9.3.2 Passage Retrieval Component

We use standard IR techniques in this part of our system. Incoming

documents are processed in chunks, where each chunk can be defined as a fixed

span of time or as a fixed number of documents, as preferred by the user. For

each incoming document, corpus statistics like the IDF (Inverted Document

Frequency) of each term are updated. We use a state-of-the-art named entity

identifier and tracker (9; 15) to identify person and location names, and merge

them with co-referent named entities seen in the past. Then the documents

are segmented into passages, which can be a whole document, a paragraph,

a sentence, or any other continuous span of text, as preferred. Each passage

is represented using a vector of TF-IDF (Term Frequency-Inverse Document

Frequency) weights, where term can be a word or a named entity.

Given a query (represented using its profile as described in Section 9.3.1),

the system computes a relevance score (the posterior probability of belonging

to class '+1') for each passage x using the logistic regression solution w ∗ :

1

(1 + e −w ∗ ·x )

x, w ∗ )=

f RL ( x )

≡

P ( y =1

|

(9.1)

Passages are ordered by these relevance scores and the ones with scores

above a relevance threshold (tuned on a training set) comprise the relevance

list that is passed on to the next step - novelty detection.

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home