Intelligent News Aggregator for German with Sentiment Analysis - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

analysis. We weight the features by different weighting schemes ranging from simple

counts to enhanced weighting schemes like tf-idf and tf-delta-idf, a sentiment-based

tf-idf value, as proposed in [ 30 ].

Opinion Target Extraction . We also provide a supervised approach for the

extraction of opinion targets, which may be reasonably applicable as long as the

targets are explicitly mentioned within the quotations and can be localized by text

anchors. In our target extraction approach we first select a set of candidates and then

classify each candidate with a binary classifier to predict whether it is the wanted

target or not. The opinion target candidates are represented as feature vectors of POS

tags surrounding each candidate in a window of two words before and two words

after the candidate. We perform a multistage decision process to prefer specific can-

didates over other candidates if more than one candidate was classified as opinion

target. We check the environment nearby the candidate and accept only candidates

conforming specific predefined POS patterns.

1.4.4 Sentiment Features

Finding an appropriate representation of the data at hand is a crucial task since the

performance not only depends on the chosen machine learning algorithms but also

to a large extent on the selected features [ 12 ]. In this section we present the features

explored for our sentiment analysis approach. Besides primitive features we also

exploit derived lexical and linguistic features. Following [ 37 ] we include position

information for each feature. We encode whether the features were calculated based

on the beginning, the end, or the middle part of the text or whether the entire text

was considered.

Bag-of-Words . A standard representation of documents for natural language

processing tasks is the bag-of-words model [ 29 ]. It represents a document as a vector

of weighted terms from a dictionary. We built up a dictionary with uni- and bigrams

and calculated the idf and delta-idf based on German news articles from a time period

of three months of 2012, the same time period as used for creating the evaluation

corpus described in Sect. 1.4.5 . The lexicons with both uni- and bigrams were limited

to 10,000 entries each. We included bigrams because they encode word order, which

adds meaningful sentence structure information to the feature vector representation.

Previous work shows that bigrams help in the task of sentiment analysis [ 51 ]. In

order to compile a feature vector for a document we remove stop words and lower-

case and stem each term using Apache's German Analyzer. 22 Then we weight each

term (uni- and bigram) using one of four different schemes: occurrence flag (0/1),

tf (term frequency), tf-idf (term frequency x inverse document frequency), tf-delta-idf

(term frequency x delta inverse document frequency).

.html .

Search WWH ::

Custom Search

Home