Information Technology Reference
In-Depth Information
Fig. 1. Overview of the processes involved in our automatic topic detection approach
is considered as a concept term furthermore. In case no article is found, the substantive
is not considered as a concept term as probably not providing conceptual information
(like the term “doing” ). In addition, auxiliary verbs such as “having” are excluded in
the first place by filtering all concept terms based on a predefined stoppword list.
In order to detect named entities consisting of more than one word, adjectives and/or
nouns, and proper nouns appearing successively are tested for their lexical “together-
ness”. Therefore we make use of the concept information provided by Wikipedia in
terms of single articles [13]. More precisely, each of these potential named entities are
mapped onto the set of all Wikipedia articles A wiki twice: once as a whole and once
noun-wise. This mapping process is accomplished via a mapping function
A wiki (1)
where cterm is either the potential named entity or a single noun. To realize f ,we
built up an Apache Lucene [14] search index containing documents for every Wikipedia
article including information about their titles, textual descriptions, textual anchors of
their incoming links, and redirects. This allows us to estimate both mappings by means
of the Lucene similarity score
f : cterm
score ( q,d )= Σ t∈q ( tf ( t
d )
·
idf ( t )
·
b f ·
n ( q,d ))
(2)
where tf ( t
cterm in d , idf ( t )
indicates the general importance of t within all documents, b f refers to the field boost
in case of an exact match of cterm in the article title, and n ( q,d ) combines Lucene-
internal normalization factors. The outcome providing the better result determines the
final composition of the concept term. By this, Wikipedia is acting as a concept iden-
tifier. As a result of the conceptualization step, a set of concept terms providing the
basis for the automatic detection of potential dialog topics is determined. Thus, for the
utterance “Ah, then you are a fan of Bayern Munich?” the concept terms “fan” and
“Bayern Munich” are specified.
d ) specifies the term frequency of each term t
 
Search WWH ::




Custom Search