Databases Reference
In-Depth Information
and negative classes. For example, using the simple bag-of-words (BOW) approach, sentences are
represented as unordered collections of their unigrams and the learning method determines which
unigrams tend to occur with each class.
bag-of-
words
One can go beyond the BOW approach to learn much more complex pat-
terns. Riloff and Phillips [ 2004 ] presented a method for learning subjective extraction patterns from
a large amount of data, which takes subjective and non-subjective text as input, and outputs sig-
nificant lexico-syntactic patterns, that can discriminate between subjective and non-subjective sen-
tences. These patterns are based on shallow syntactic structure output by the Sundance dependency
parser [ Riloff and Phillips , 2004 ]. They are extracted by exhaustively applying syntactic templates
such as <subj>passive-verb and active-verb <dobj> to a training corpus, with an extracted
pattern for every instantiation of the syntactic template. These patterns are scored according to the
probability of a sentence to be subjective given the pattern and the frequency of the pattern. Because
these patterns are based on syntactic structure, they can represent subjective expressions that are not
fixed word sequences and would therefore be missed by a simple n-gram approach.
The disadvantage of a lexicon-based system is that it usually relies on a hand-built dictionary,
which requires many human hours and limits portability to new domains and modalities since
vocabularies may differ. An advantage of a statistical system, in contrast, is that porting it to a
new domain only requires the new dataset and any requisite annotations from which to learn. The
annotation itself may admittedly be time-consuming, depending how coarse or fine it is, but once
complete, the system will automatically learn new subjective and polar terms for that domain. On
the other hand, the advantage of a lexicon-based system is that it can have very high precision, since
the dictionaries are typically hand-built and tuned for a particular domain.
As mentioned, the supervised vs. unsupervised distinction is roughly related to the lexicon-
based vs. statistical-based distinction in practice. Lexicon-based approaches are often unsupervised,
rule-based algorithms (e.g., SO-Cal), while statistical systems typically are trained on labeled data.
However, systems can easily cut across these distinctions, e.g., by using the output of a lexicon-
based system as a feature of a statistical classifier. And a lexicon-based system itself need not be
entirely hand-crafted, but can incorporate words and associated scores that are learned from data in
a supervised or semi-supervised fashion.
3.3.2 SENTIMENT DETECTION IN CONVERSATIONS
With meetings, most recent sentiment detection work has focused on the AMI corpus (see Chap-
ter 2 ). Somasundaran et al. [ 2007 ] describe their coding scheme for opinion annotation and apply it
to a subset of the AMI corpus. They consider two types of opinions: expressing sentiment, which in-
cludes feelings and emotions, and arguing, which includes convictions and persuasion. Their system
for detecting sentiment and arguing is a good example of combining lexicon-based and statistical
approaches, as they avail themselves of existing sentiment lexicons and create a new arguing lexicon,
but also combine these knowledge sources with dialogue act and adjacency pair information in a
Search WWH ::




Custom Search