Information Technology Reference
In-Depth Information
one hand the news texts are regarded as objective information sources reporting
facts. They are clustered according to events by the Neofonie GmbH (Sect. 1.2.2 ).
On the other hand the system aims at identifying subjective parts in the form of
quotations (Sect. 1.2.4 ) and at determining the sentiment polarity of the expressed
statements (Sect. 1.2.5 ). In parallel to the pipeline, the Neofonie GmbH assigns the
news articles to automatically identified abstract meta-topics, which connect themat-
ically related topics (Sect. 1.2.3 ). The graphical user interface is described in Sect. 1.5
where also screenshots of the system are presented.
1.2.1 Preprocessing
The initial step for the analysis of news articles is linguistic preprocessing. After
having crawled and deduplicated the news articles, we first split the text of each article
into tokens and sentences. This is an important prerequisite for various linguistic tasks
such as part-of-speech (POS) tagging, chunking, and also our quotation extraction
approach. The next step is lemmatizing all words of the text. Mapping words to their
canonical form allows looking them up in dictionaries in steps that follow. We find
verbs starting quotations in this way. The task of POS taggers is to assign each word
of a text its part of speech. We use the output of the POS tagger at several points in
our system. For instance, we make use of POS information to compile feature vectors
for our supervised sentiment analysis approach. The system exploits a lexicon-based
named entity recognition approach. We use the German version of Wikipedia 7 for
identifying and linking named entities. The named entities contribute to the concept
vectors required for our topic detection and tracking approach (Sect. 1.2.2 ).
1.2.2 Topic Detection and Tracking
The Topic Detection and Tracking program defines an event as “something that hap-
pens at some specific time and place along with all necessary preconditions and
unavoidable consequences”. Topics (or stories) comprise a triggering event and all
directly related events and activities [ 17 ]. As news stories evolve over time, the task
of TDT approaches is to either identify news articles starting a topic or to assign news
articles to existing topics. In our system we employ an incremental agglomerative
clustering approach for TDT. We represent each news document as a vector of con-
cepts including named entities. Each cluster represents a topic and is specified by a
centroid vector with averaged concept weights of the covered news articles. Incom-
ing news documents are compared to the centroid vectors of all existing topics. If
7
http://de.wikipedia.org/ .
Search WWH ::




Custom Search