Utility-Based Information Distillation - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

be initialized based on the pyramid approach (14) to assign different levels of

importance to nuggets.

Since the repeated occurrences of the same piece of information are less

useful (or not useful at all) to the user, we dampen the weight w j of

each nugget N j whenever it occurs in a system-produced passage, so that

subsequent occurrences receive lower utility. That is, for each nugget N j ,its

weight is updated as w j = w j ∗

β ,where p is a preset dampening factor.

When β = 1, no utility dampening occurs and each occurrence of the same

nugget is given equal score, as with traditional relevance based methods.

At the other extreme, β = 0 causes only the first occurrence of a nugget

to be scored, ignoring all its subsequent occurrences. As a middle ground,

a small non-zero dampening factor can be used if the user prefers to see

some redundancy, perhaps as an indicator of importance or reliability of the

presented information.

These nugget weights are preserved between evaluation of successive ranked

lists produced by the system, since the users are expected to remember what

the system showed them in the past. Hence, systems that show novel items

(i.e., items not seen in the past) and also produce non-redundant ranked lists

(i.e., do not display very similar passages at multiple positions in the same

ranked list) are favored by such an evaluation scheme.

9.4.2.2

Ranked list length penalty

Each passage selected by the system for the user's attention has an

associated cost in terms of user time and effort to review it. Therefore, an

adaptive filtering system must learn to limit the length of its ranked list to

balance this cost against the gain, as measured by NDCU. However, NDCU as

such is a recall oriented measure giving no incentive to a system to limit the

ranked list length, since each additional passage in the list can only increase

the utility score. Hence, we assign a penalty to longer ranked lists, and

calculate Penalized Normalized Discounted Utility (PNDCU) as follows:

PNDCU = λ · NDCU +(1 − λ ) · (1 − log m ( l + 1))

(9.7)

where l is the length of the system-produced ranked list, and m is the

maximum ranked list length allowed. λ controls the trade-off between the

gain and cost of going through the system's output.

9.5 Data

TDT4 was the evaluation benchmark corpus in TDT2002 and TDT2003.

The corpus consists of over 90 , 000 news articles from multiple sources (AP,

NYT, CNN, ABC, NBC, MSNBC, Xinhua, Zaobao, Voice of America, PRI

Search WWH ::

Custom Search

Home