Information Technology Reference
In-Depth Information
current prominent information extraction methods advocate deeper NLP components
for concept and relation extraction, e.g., syntactic and semantic dependency analysis
of complete sentences and the integration of rich linguistic knowledge bases like Word
Net.
The paper is organized as follows. In the section 2 we briefly summarize the topic
graph extraction process. 1 For the sake of completeness and readability, we present in
section 3 details and examples of the user interfaces for the iPad and iPhone, respec-
tively.
A major obstacle of the topic graph extraction process described in section 2 is its
purely syntactic nature. Consequently, in section 4, we introduce a semantic clustering
approach that helps to improve the quality of the extracted topics. The next sections then
describe details of the evaluation of the improved topic extraction process (section 5),
and present our current user experience for the iPad and iPhone user interfaces (section
6). Related work is discussed in section 7, before we conclude the paper in section 8.
2
Topic-Driven Exploration of Web Content
The core idea is to compute a set of chunk-pair-distance elements for the N -first web
snippets returned by a search engine for the topic Q , and to compute the topic graph
from these elements. 2 In general for two chunks, a single chunk-pair-distance element
stores the distance between the chunks by counting the number of chunks in-between
them. We distinguish elements which have the same words in the same order, but have
different distances. For example, (Justin, Selina, 5) is different from (Justin, Selina, 2)
and (Selina, Justin, 7).
Initially, a document is created from selected web snippets so that each line contains
a complete snippet. Each of these lines is then tagged with Part-of-Speech using the
SVMTagger [8] and chunked in the next step.
The chunker recognizes two types of word chains: noun chunks and verb chunks.
Each recognized word chain consists of the longest matching sequences of words with
the same PoS class, namely noun chains or verb chains, where an element of a noun
chain belongs to one of the predefined extended noun tags. Elements of a verb chain
only contain verb tags. For English, “word/PoS” expressions that match the regular
expression “/(N(N
|
P))
|
/VB(N
|
G)
|
/IN
|
/DT” are considered as extended noun tag and for
German the expression
“/(N(N
/ART”. The English Verbs are those whose PoS tag start with
VB (and VV in case of German). We are using the tag sets from the Penn treebank
(English) and the Negra treebank (German).
The chunk-pair-distance model is computed from the list of noun group chunks. 3
This is fulfilled by traversing the chunks from left to right. For each chunk c i ,asetis
computed
|
E))
|
/VVPP
|
/AP
|
to c i ,
by
considering
all
remaining
chunks
and
their
distance
i.e.,
1
This part of the work has partially been presented in [12] and hence will be described and
illustrated compactly.
2
We are using Bing (http://www.bing.com/) for web search with N set to max. 1000.
3
The main purpose of recognizing verb chunks is to improve proper recognition of noun groups.
They are ignored when building the topic graph, but see sec. 8.
Search WWH ::




Custom Search