Database Reference
In-Depth Information
if λ = 0. (See ( 26 ) and ( 29 ) for computational complexity, parameter tuning
and implementation issues.)
In adaptive filtering, each query is considered as a class, and the class
model - a set of regression coecients corresponding to individual terms -
is the query profile as viewed by the system. As for the training set, we use
the query itself as the initial positive training example of the class, and the
user-highlighted pieces of text (marked as Relevant or Not-relevant) during
feedback as additional training examples. To address the cold start issue
in the early stage before any user feedback is obtained, the system uses a
small sample from a retrospective corpus as the initial negative examples in
the training set. The details of using logistic regression for adaptive filtering
(assigning different weights to positive and negative training instances, and
regularizing the objective function to prevent overfitting on training data) are
presented in (26).
The class model w learned by Logistic Regression, or the query profile, is
a vector whose dimensions are individual terms and whose elements are the
regression coecients, indicating how influential each term is in the query
profile. The query profile is updated whenever a new piece of user feedback
is received. A temporally decaying weight can be applied to each training
example, as an option, to emphasize the most recent user feedback.
9.3.2 Passage Retrieval Component
We use standard IR techniques in this part of our system. Incoming
documents are processed in chunks, where each chunk can be defined as a fixed
span of time or as a fixed number of documents, as preferred by the user. For
each incoming document, corpus statistics like the IDF (Inverted Document
Frequency) of each term are updated. We use a state-of-the-art named entity
identifier and tracker (9; 15) to identify person and location names, and merge
them with co-referent named entities seen in the past. Then the documents
are segmented into passages, which can be a whole document, a paragraph,
a sentence, or any other continuous span of text, as preferred. Each passage
is represented using a vector of TF-IDF (Term Frequency-Inverse Document
Frequency) weights, where term can be a word or a named entity.
Given a query (represented using its profile as described in Section 9.3.1),
the system computes a relevance score (the posterior probability of belonging
to class '+1') for each passage x using the logistic regression solution w :
1
(1 + e −w ·x )
x, w )=
f RL ( x )
P ( y =1
|
(9.1)
Passages are ordered by these relevance scores and the ones with scores
above a relevance threshold (tuned on a training set) comprise the relevance
list that is passed on to the next step - novelty detection.
Search WWH ::




Custom Search