Information Technology Reference
In-Depth Information
identify and segment tables in electronic or paper documents. The method is
independent from any table separators because only the words are considered.
This trait makes it also perfect for adaptation for vector-based documents.
1.3
Overview
In order to extract the layout and being able to process it, several steps are
necessary. First, we need the full text literature, which is often a problem due
to copyright issues. Since the full text paper is usually available in PDF format,
we need to extract the text from there. The real problem in this step is: we need
to know where the text is on the page; otherwise we could just use a normal
text extractor. Also, the images have to be extracted, again with the positional
information intact. This information is handed over from the layout analyzer
to the next step (cf. Fig. 1) by storing them in an XML format we call ELL
(Enriched Layout Language).
Fig. 1.1. Workflow of the layout analysis
The next steps are modularized. The ELL files are regrouped internally to
provide the correct reading order. Also, image or table extraction algorithms are
used. To provide a semantic annotation their captions are found by using another
algorithm. The results of these algorithms are indexed to be presented on a web-
based search engine. In order to have a standardized interface, we created the
XML exchange format SIL (Search Interface Language). The SIL files are used
by the web platform CaptionSearch to provide a search interface to the user.
CaptionSearch allows multiple corpora to be handled separately. Also, different
users may be registered to different sets of corpora.
Search WWH ::




Custom Search