Utility-Based Information Distillation - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

gain obtained by a user by going through a ranked list, from the top, up to a

given position. It allows for graded relevance, and discounts the gain received

at lower ranks to favor systems that place highly relevant documents near the

top of the ranked list. The DCG score at rank n is calculated as follows:

n

DCG ( n )=

G ( d i ,q ) / log b ( i + b

−

1)

(9.4)

i =1

where d i is the i-th document in the ranked list, G ( d i ,q ) is the graded relevance

of document d i with respect to the query q and parameter b is a pre-specified

constant to control the discount rates with respect to the position of each

document in the ranked list. The DCG score is normalized with respect to

the ideal (best possible) DCG to get the Normalized Discounted Cumulated

Gain (NDCG). To obtain a single score for the system's performance on a

query, the NDCG scores at all ranks are averaged. Given a test set of queries,

the per-query NDCG scores are further averaged to produce a global score.

In our evaluation scheme, we make two changes to the standard NDCG

metric, which we will describe in detail:

1. Replace graded document relevance G ( d i ,q ) with graded passage utility

U ( p i ,q ) that takes both nugget-based relevance and novelty into

account.

2. Penalize longer ranked lists to account for the effort spent by the user

in going through the system output.

9.4.2.1

Graded passage utility

To account for the presence of nuggets as well as whether the nuggets have

been seen by the user in the past, we calculate the gain received from each

passage in terms of utility U ( p i ,q ), instead of relevance G ( d i ,q ). Thus, we

define Discounted Cumulated Utility (DCU) as:

n

DCU ( n )=

U ( p i ,q ) / log b ( i + b

−

1)

(9.5)

i =1

which is normalized with respect to the ideal DCU to get the Normalized

Discounted Cumulated Utility (NDCU). U ( p i ,q )iscalculatedas:

U ( p i ,q )=

j∈C ( p i )

w j

(9.6)

where C ( p i ) is the set of nuggets contained in passage p i , determined using

the rules as described in 9.4.1.2. Each nugget N j has an associated weight

w j , which determines the utility derived by seeing that nugget in a system-

produced passage. These weights are initially set to be equal, but could also

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home