Databases Reference
In-Depth Information
machine learning techniques, where a classifier can be trained on data labeled as subjective vs. non-
subjective or positive vs. negative, at the document or sentence level. Alternatively, semi-supervised
or unsupervised algorithms can predict subjectivity and polarity with little or no labeled data; an
example of a semi-supervised approach is to use a manually selected set of seed words that are known
seed words
to be subjective or to have a particular polarity, and use those seed words to automatically label
sentences or documents. This can lead to the discovery of new subjective words, expansion of the
seed set and repetition of the whole process. This would be an example of a boot-strapping procedure.
boot-
strapping
A related distinction is between lexicon-based approaches and statistical approaches, though
lexicon-
based
this is less of a theoretical distinction than a reflection of common system implementations. In a
lexicon-based approach, there is a dictionary of subjective or polar words, usually associated with
numerical scores indicating the strength of the word polarity. For example, the scores may range from
-5 to +5, with -5 indicating very negative sentiment (e.g., “terrible”) and +5 very positive sentiment
(e.g., “awesome”). Given a text, a lexicon-based system identifies words contained in its lexicon and
retrieves their word scores. A phrase, sentence or document can be scored, in the simplest case, by
summing over its sentiment word scores.
SO-Cal [ Taboada et al. , 2010 ] (for Semantic Orientation Calculator ) is a lexicon-based system
that is considerably more sophisticated than that and illustrates why simply summing over word
scores is not sufficient. To give just one example, such a system must account for negators that
negators
can weaken or reverse a word's dictionary score. The following three sentences help illustrate this
phenomenon:
1. I love this interface design.
2. I don't love this interface design.
3. I hate this interface design.
It seems clear that Sentence 1 is very positive and Sentence 3 is very negative, as indicated
by the words love and hate , respectively. However, Sentence 2 also contains the word love . Based
solely on the dictionary scores for the sentiment words, this sentence should therefore be considered
positive as well. Of course, we know that the preceding word don't negates that positive sentiment,
and any system will need to account for this effect. However, if we simply reverse the sentence score
due to the presence of the negator, we will end up assigning a very negative score similar to the score
for Sentence 3. Intuitively, it seems that Sentence 2 is more ambivalent than Sentence 3 and should
have more of a neutral score. For that reasons, systems like SO-Cal make more subtle adjustments
to a sentence score when a negator is present.
In contrast, many statistical systems do not rely on hand-crafted dictionaries, but rather au-
statistical
systems
tomatically learn subjective terms or phrases from labeled or unlabeled data. One idea is to build a
list of subjective words by identifying the words that occur most frequently in text labeled as being
subjective, once stopwords have been removed. Other statistical systems never use an explicit list of
subjective or polar words, but rather extract raw lexical features such as unigrams and bigrams and
let the machine learning method automatically learn how those features correlate with the positive
Search WWH ::




Custom Search