Information Technology Reference
In-Depth Information
analysis. We weight the features by different weighting schemes ranging from simple
counts to enhanced weighting schemes like tf-idf and tf-delta-idf, a sentiment-based
tf-idf value, as proposed in [ 30 ].
Opinion Target Extraction . We also provide a supervised approach for the
extraction of opinion targets, which may be reasonably applicable as long as the
targets are explicitly mentioned within the quotations and can be localized by text
anchors. In our target extraction approach we first select a set of candidates and then
classify each candidate with a binary classifier to predict whether it is the wanted
target or not. The opinion target candidates are represented as feature vectors of POS
tags surrounding each candidate in a window of two words before and two words
after the candidate. We perform a multistage decision process to prefer specific can-
didates over other candidates if more than one candidate was classified as opinion
target. We check the environment nearby the candidate and accept only candidates
conforming specific predefined POS patterns.
1.4.4 Sentiment Features
Finding an appropriate representation of the data at hand is a crucial task since the
performance not only depends on the chosen machine learning algorithms but also
to a large extent on the selected features [ 12 ]. In this section we present the features
explored for our sentiment analysis approach. Besides primitive features we also
exploit derived lexical and linguistic features. Following [ 37 ] we include position
information for each feature. We encode whether the features were calculated based
on the beginning, the end, or the middle part of the text or whether the entire text
was considered.
Bag-of-Words . A standard representation of documents for natural language
processing tasks is the bag-of-words model [ 29 ]. It represents a document as a vector
of weighted terms from a dictionary. We built up a dictionary with uni- and bigrams
and calculated the idf and delta-idf based on German news articles from a time period
of three months of 2012, the same time period as used for creating the evaluation
corpus described in Sect. 1.4.5 . The lexicons with both uni- and bigrams were limited
to 10,000 entries each. We included bigrams because they encode word order, which
adds meaningful sentence structure information to the feature vector representation.
Previous work shows that bigrams help in the task of sentiment analysis [ 51 ]. In
order to compile a feature vector for a document we remove stop words and lower-
case and stem each term using Apache's German Analyzer. 22 Then we weight each
term (uni- and bigram) using one of four different schemes: occurrence flag (0/1),
tf (term frequency), tf-idf (term frequency x inverse document frequency), tf-delta-idf
(term frequency x delta inverse document frequency).
22 https://lucene.apache.org/core/3_6_2/api/all/org/apache/lucene/analysis/de/GermanAnalyzer
.html .
Search WWH ::




Custom Search