Information Technology Reference
In-Depth Information
Fig. 4. The alternative representation of the
topic graph displayed in Fig. 1 on the iPhone
Fig. 5. The snippets after touching the item
“fukushima daiichi”
initial manual analysis of a small sample revealed, that the extracted chunks sometimes
are either incomplete or simply wrong. Consequently, this also caused the “readability”
of the resulting topic graph due to “meaningless” relationships. Note that the decreased
quality of PoS tagging is not only caused by the different style of the “snippet lan-
guage”, but also because PoS taggers are usually trained on linguistically more well-
formed sources like newspaper articles (which is also the case for our PoS tagger in use
which reports an F-measure of 97.4% on such text style).
Nevertheless, we want to benefit from PoS tagging during chunk recognition in order
to be able to identify, on the fly, a shallow phrase structure in web snippets with minimal
efforts. In order to tackle this dilemma, investigations into additional semantical-based
filtering seems to be a plausible way to go.
About the Performance of Chunking Web Snippets
As an initial phase into this direction we collected three different corpora of web snip-
pets and analysed them according to the amount of well-formed sentences and incom-
plete sentences contained in the web snippets. Furthermore, we also randomly selected
a subset of 100 snippets from each corpus and manually evaluated the quality of the
PoS tagging result. The snippet corpora and results of our analysis are as follows (the
shortcuts mean: #s = number of snippets retrieved, #sc = well-formed sentences within
the set of snippets, #si = incomplete sentences within the snippets, #w = number of
 
Search WWH ::




Custom Search