Utility-Based Information Distillation - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

of supervised learning algorithms (e.g., Rocchio-style classifiers, Exponential-

Gaussian models, local regression and logistic regression approaches) have

been studied in adaptive settings with explicit and implicit relevance feedback,

and on benchmark datasets from TREC (Text Retrieval Conferences) and the

TDT (Topic Detection and Tracking) evaluation forum (1; 5; 8; 18; 25; 31;

29). Regularized logistic regression (26), for example, is one of the strong-

performing methods in terms of both effectiveness and eciency, and is easy

to scale for frequent adaptations over large datasets such as the TREC-10

corpus with over 800,000 documents and 84 topics.

9.1.2 Related Work in Topic Detection and Tracking (TDT)

Topic Detection and Tracking (TDT) research focuses on automated

detection and tracking of news events from multiple sources of temporally

ordered stories (2). TDT has two primary tasks: topic tracking and novelty

detection. The topic tracking task, although defined independently, is almost

identical to the adaptive filtering task except that user feedback is assumed

to be not available, although pseudo-relevance feedback (PRF) by the system

is allowed. PRF means that the system takes the top-ranking documents in

a retrieved list for a topic as truly relevant in its profile adaptation for that

topic. PRF may be useful when training examples are sparse and when true

relevance feedback is not sucient (26).

Novelty detection (ND), the other primary task in TDT, aims to detect the

first report of each new event from temporally ordered news stories. The task

is also called First-Story Detection (FSD) or New Event Detection (NED).

There has been a significant body of work for addressing ND problems.

Yang et al. (23) examined incremental clustering for grouping documents

into events, and used the cosine similarity in combination with some time-

decaying function to measure the novelty of new documents with respect to

historical events. Zhang et al. (30) developed a Bayesian statistic framework

for modeling the growing number of events over time in a non-parametric

Dirichlet process. Yang et al. (24) studied effective use of Named Entities

in the modeling of novelty of documents conditioned on events and higher-

level topics. Zhang et al. (32) compared alternative measures for sentence-

level novelty detection conditioned on perfect knowledge of document-level

relevance; cosine similarity worked the best in their experiments. Allan et al.

(3) argued for the importance of comparing novelty measures under a more

realistic assumption, i.e., under the condition that sentence-level relevance

is not available but predicted by a system. Kuo et al. (12) developed a

indexing-tree strategy for speedy computation and investigated the use of

Named Entities.

Search WWH ::

Custom Search

Home