Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

identify and segment tables in electronic or paper documents. The method is

independent from any table separators because only the words are considered.

This trait makes it also perfect for adaptation for vector-based documents.

1.3

Overview

In order to extract the layout and being able to process it, several steps are

necessary. First, we need the full text literature, which is often a problem due

to copyright issues. Since the full text paper is usually available in PDF format,

we need to extract the text from there. The real problem in this step is: we need

to know where the text is on the page; otherwise we could just use a normal

text extractor. Also, the images have to be extracted, again with the positional

information intact. This information is handed over from the layout analyzer

to the next step (cf. Fig. 1) by storing them in an XML format we call ELL

(Enriched Layout Language).

Fig. 1.1. Workflow of the layout analysis

The next steps are modularized. The ELL files are regrouped internally to

provide the correct reading order. Also, image or table extraction algorithms are

used. To provide a semantic annotation their captions are found by using another

algorithm. The results of these algorithms are indexed to be presented on a web-

based search engine. In order to have a standardized interface, we created the

XML exchange format SIL (Search Interface Language). The SIL files are used

by the web platform CaptionSearch to provide a search interface to the user.

CaptionSearch allows multiple corpora to be handled separately. Also, different

users may be registered to different sets of corpora.

Search WWH ::

Custom Search

Home