Information Technology Reference
In-Depth Information
words, F(x) = F-measure achieved by the PoS tagger on a subset of 100 snippets with
x words).
Fukushima. This corpus represents snippets mainly coming from official online news
magazines. The corpus statistics are as follows:
#s #sc #si #w F(2956)
240 195 182 6770 93.20%
Justin Bieber. This corpus represents snippets coming from celebrity magazines or
gossip forums. The corpus statistics are:
#s #sc #si #w F(3208)
240 250 160 6420 92.08%
New York. This corpus represents snippets coming from different official and private
homepages, as well as from news magazines. The corpus statistics are:
#s #sc #si #w F(3405)
239 318 129 6441 92.39%
This means that 39% of all tagged sentences have been incomplete and that the perfor-
mance of the Pos tagger decreased by about 5% F-measure (compared to the reported
97.4% on newspaper). Consequently, a number of chunks are incorrectly recognized.
For example, it turned out that date expressions are systematically tagged as nouns, so
that they will be covered by our noun chunk recognizer although they should not (cf.
section 2). Furthermore, the genitive possessive (the “'s” as in “Japan's president”) was
classified wrongly in a systematic way which also had a negative effect on the perfor-
mance of the noun chunker. Very often nouns were incorrectly tagged as verbs because
of erroneously identified punctuation. Thus, we need a filtering mechanism that is able
to identify and remove the wrongly chunked topic-pairs.
Semantic Filtering of Noisy Chunk Pairs
A promising algorithmic solution to this problem is provided by the online clustering
system Carrot2 [14] that computes sensible descriptions of clustered search results (i.e.,
web documents). The Carrot2 system is based on the Lingo [13] algorithm. Most algo-
rithms for clustering open text follow a kind of “document-comes-first” strategy, where
the input documents are clustered first and then, based on these clusters, the descrip-
tive terms or labels of the clusters are determined, cf. [6]. The Lingo algorithm actually
reverses this strategy by following a three-step “description-comes-first” strategy (cf.
[13] for more details): 1) extraction of frequent terms from the input documents, 2) per-
forming reduction of the (pre-computed) term-document matrix using Singular Value
Decomposition (SVD) for the identification of latent structure in the search results, and
3) assignment of relevant documents to the identified labels.
The specific strategy behind the Lingo algorithm matches our needs for finding mean-
ingful semantic filters very well: we basically use step 1) and 2) to compute a set of
 
Search WWH ::




Custom Search