Information Technology Reference
In-Depth Information
In [7], the tested sections did not include tables, figure captions or table cap-
tions. Figures have been dealt with, by [10] and our own [11]. However, ta-
ble contents are a completely new topic and not been reseatched yet, to our
knowledge.
1.2
Background
The basic problem of handling PDF documents is that the text information is not
freely available. While an HTML file stripped of its tags usually delivers legible
text, even the simple task of text extraction from a PDF is rather complicated.
Down to the basics, PDF is foremost a visual medium, describing for each glyph
(= character or picture) where it should be printed on the page [12]. Most PDF
converters simply emulate this glyph-by-glyph positioning in ASCII [13].
Still, since the position of all glyphs is known, the original layout can be
deduced and the semantic connection can be restored. For HTML, the layout
information has successfully been used to improve the classification of web pages
[14]. We extract the same layout information out of PDF documents. This offers
us a multitude of possibilities, from the restoration of the original reading order
over table recognition up to finding the images in the paper.
Usually, image retrieval in biomedical context is not used for literature re-
trieval, but for image retrieval on large databases. The general PicHunter ap-
proach [15] is an example for such a content-based image retrieval system. With
updated Bayesian formulas the framework of PicHunter has been adapted to
refine the results of a query by predicting the users action. This approach ad-
dresses the image retrieval in general and does not discuss image retrieval within
the biological context.
The IRMA-concept (Image Retrieval in Medical Applications) [16, 17] has
been developed to handle primitive and semantic queries and to browse medical
images with respect to medical applications. The approach is able to support
content understanding and highly differentiated queries on an abstract informa-
tion level. In order to compensate for the different smaller structures of a typical
medical image, local representations are used to categorize the entire image.
Those local features are then compared with a k-nearest neighbour algorithm.
Tables pose a completely different problem. Green and Krishnamoorthy [18]
developed a method that is capable of analyzing model-based tables. To make
this approach work, a model or template of the tables in question has to be pro-
vided. Zuyev [19] introduced an algorithm for table image segmentation that uses
table grids. Tables that use table lines can be identified well, but the approach
is not able to find tables without these separators.
Both methods have the problem that they rely much on table separators, like
lines or connectors, like dots between the values. Unfortunately, these marks
do not come out clearly in vector-based documents. All we have there are the
characters and the position of the characters. Also, not all tables use these sep-
arators constantly. The T-Recs system [20] follows a bottom-up approach to
 
Search WWH ::




Custom Search