Cross-Domain Opinion Word Identification with Query-By-Committee Active Learning - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

List-based systems output a list of opinion words. Such systems are usually

either propagation-based or co-occurrence-based. Propagation-based approaches

have two main steps: sentiment seed collection and sentiment value propagation.

In the first step, seeds with accurate sentiment values are collected. Usually,

these seeds are manually annotated or collected from existing dictionaries. In

the second step, an existing word/phrase/concept graph is used as the founda-

tion. Sentiment values are propagated from seeds to the remaining parts of the

foundation graph [3, 9]. Co-occurrence-based approaches employ co-occurrence

statistics to estimate if an opinion word candidate corresponds to a given opinion

target and vice versa [10, 6]. Both list-based approaches can construct opinion

word dictionaries without human annotation.

List-based OWI, however, does not tell us much about the context in which

opinion words are used

it simply outputs a list of all the opinion words in a body

of text. To better understand opinion words in context, it is necessary to find

the exact sentence positions where the words are mentioned. One common way

of identifying the positions of opinion words in the output list is to match them

back against the text. All matched occurrences in the text are then regarded

as opinion mentions. The problem with this approach is that not all matched

positions are actual opinion mentions. For example, the word “

—

/delicious”

would not necessarily represent an opinion in a review of a restaurant named

“

美味

/Delicious Restaurant”.

The mention-based approach is designed to identify and locate all opinion

mentions in reviews. Mention-based OWI is usually formulated as a sequence

labelling task in which tokens are either labelled as “opinion-word mention” or

“other” [11]. The approach can achieve high accuracy, but because it requires

large amounts of annotated data, construction of a mention-based OWI system

for a new domain can be costly in terms of human effort. One way to reduce

this cost is to adapt an existing system for use in a new domain. However,

cross-domain OWI poses its own problems, as the original domain data may

not be compatible with the new domain. Finding the optimal way to selectively

annotate sucient data from the new domain is a critical challenge in cross-

domain OWI.

Active learning is a method employed in many NLP tasks to select new data.

For example, it has performed well in named-entity recognition [8] and sentiment

classification [5]. The objective of active learning is to use the least amount

of annotated data to achieve the highest performance. Query by Committee

(QBC) [7] is one of the most ecient active learning algorithms. The QBC

approach asks every model (committee member) to vote on every query's (data

instance's) label. Only the most uncertain instances (the most diversely labeled)

are selected for manual annotation. In this study, we propose a new cross-domain

opinion word extraction approach with QBC-based active learning. We adapt our

system from one of three source domains to one of three target domains. Our

system is tested on six source-target domain pairs in total. We review the related

research in Section 2 and illustrate our approach in Section 3. In Section 4, we

report our evaluation results. Our concluding remarks are given in Section 5.

美味餐廳

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home