MobEx: A System for Exploratory Search on the Mobile Web - Agents and Artificial Intelligence - page 118

Information Technology Reference

In-Depth Information

words, F(x) = F-measure achieved by the PoS tagger on a subset of 100 snippets with

x words).

Fukushima. This corpus represents snippets mainly coming from official online news

magazines. The corpus statistics are as follows:

#s #sc #si #w F(2956)

240 195 182 6770 93.20%

Justin Bieber. This corpus represents snippets coming from celebrity magazines or

gossip forums. The corpus statistics are:

#s #sc #si #w F(3208)

240 250 160 6420 92.08%

New York. This corpus represents snippets coming from different official and private

homepages, as well as from news magazines. The corpus statistics are:

#s #sc #si #w F(3405)

239 318 129 6441 92.39%

This means that 39% of all tagged sentences have been incomplete and that the perfor-

mance of the Pos tagger decreased by about 5% F-measure (compared to the reported

97.4% on newspaper). Consequently, a number of chunks are incorrectly recognized.

For example, it turned out that date expressions are systematically tagged as nouns, so

that they will be covered by our noun chunk recognizer although they should not (cf.

section 2). Furthermore, the genitive possessive (the “'s” as in “Japan's president”) was

classified wrongly in a systematic way which also had a negative effect on the perfor-

mance of the noun chunker. Very often nouns were incorrectly tagged as verbs because

of erroneously identified punctuation. Thus, we need a filtering mechanism that is able

to identify and remove the wrongly chunked topic-pairs.

Semantic Filtering of Noisy Chunk Pairs

A promising algorithmic solution to this problem is provided by the online clustering

system Carrot2 [14] that computes sensible descriptions of clustered search results (i.e.,

web documents). The Carrot2 system is based on the Lingo [13] algorithm. Most algo-

rithms for clustering open text follow a kind of “document-comes-first” strategy, where

the input documents are clustered first and then, based on these clusters, the descrip-

tive terms or labels of the clusters are determined, cf. [6]. The Lingo algorithm actually

reverses this strategy by following a three-step “description-comes-first” strategy (cf.

[13] for more details): 1) extraction of frequent terms from the input documents, 2) per-

forming reduction of the (pre-computed) term-document matrix using Singular Value

Decomposition (SVD) for the identification of latent structure in the search results, and

3) assignment of relevant documents to the identified labels.

The specific strategy behind the Lingo algorithm matches our needs for finding mean-

ingful semantic filters very well: we basically use step 1) and 2) to compute a set of

Next Page

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home