Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

In [7], the tested sections did not include tables, figure captions or table cap-

tions. Figures have been dealt with, by [10] and our own [11]. However, ta-

ble contents are a completely new topic and not been reseatched yet, to our

knowledge.

1.2

Background

The basic problem of handling PDF documents is that the text information is not

freely available. While an HTML file stripped of its tags usually delivers legible

text, even the simple task of text extraction from a PDF is rather complicated.

Down to the basics, PDF is foremost a visual medium, describing for each glyph

(= character or picture) where it should be printed on the page [12]. Most PDF

converters simply emulate this glyph-by-glyph positioning in ASCII [13].

Still, since the position of all glyphs is known, the original layout can be

deduced and the semantic connection can be restored. For HTML, the layout

information has successfully been used to improve the classification of web pages

[14]. We extract the same layout information out of PDF documents. This offers

us a multitude of possibilities, from the restoration of the original reading order

over table recognition up to finding the images in the paper.

Usually, image retrieval in biomedical context is not used for literature re-

trieval, but for image retrieval on large databases. The general PicHunter ap-

proach [15] is an example for such a content-based image retrieval system. With

updated Bayesian formulas the framework of PicHunter has been adapted to

refine the results of a query by predicting the users action. This approach ad-

dresses the image retrieval in general and does not discuss image retrieval within

the biological context.

The IRMA-concept (Image Retrieval in Medical Applications) [16, 17] has

been developed to handle primitive and semantic queries and to browse medical

images with respect to medical applications. The approach is able to support

content understanding and highly differentiated queries on an abstract informa-

tion level. In order to compensate for the different smaller structures of a typical

medical image, local representations are used to categorize the entire image.

Those local features are then compared with a k-nearest neighbour algorithm.

Tables pose a completely different problem. Green and Krishnamoorthy [18]

developed a method that is capable of analyzing model-based tables. To make

this approach work, a model or template of the tables in question has to be pro-

vided. Zuyev [19] introduced an algorithm for table image segmentation that uses

table grids. Tables that use table lines can be identified well, but the approach

is not able to find tables without these separators.

Both methods have the problem that they rely much on table separators, like

lines or connectors, like dots between the values. Unfortunately, these marks

do not come out clearly in vector-based documents. All we have there are the

characters and the position of the characters. Also, not all tables use these sep-

arators constantly. The T-Recs system [20] follows a bottom-up approach to

Mining Complex Data

Search WWH ::

Custom Search

Home