Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

characters as suspicious, if they do not seem to be English text. We started

with the language classifier Lingua::Identify by Jose Castro from CPAN. But

the results were not very stable, because the library is not designed to identify

text not written in a natural language and also scientific papers include a lot of

words which do not seem to be English. Therefore, we implemented four simple

heuristics to detect conversion flaws. Each one either can either mark a font as

suspicious or a continuous piece of text.

The first heuristic makes sure that the text in a font contains at least one

whitespace. If there is not a single space in the text written in this font, it is

marked as suspicious. As omitted spaces are a common conversion problem, the

second heuristic is similar by searching for words with more than 20 letters, with

the explicit exception of sequence data. The third marks possible encoding errors,

where a character in PDF without a proper encoding reference is converted as

letter and number code, like M150. To avoid special names, this pattern must

occur in at least 2 consecutive characters. Therefore, single characters, especially

mathematical operators, remain undetected. Finally, the fourth heuristic marks

all characters which are outside the 7-bit ASCII alphabet.

The data cleansing step is quite fast and can process more than a document

per second. In a quite modern test data set 19 out of 605 documents (3.1%) were

filtered out. 10 of the documents did not contain enough text, 8 did fail the quota

and one included an improper symbol. All of these papers were unintelligible

from a text extraction point of view. In older documents this rate can be higher,

as older converters tended to emphasis more on shortness than legibility.

1.3.5

CaptionSearch : The Web Application

As a simple demonstration, how the layout information can be used, a web

based search engine is used. It is called CaptionSearch, since we are technically

not searching for images or tables, but for the caption text beneath them.

After an image or table extraction has been done, the results are written to

an XML file in the SIL language. Each publication is represented by one file

containing a PDF element (line 2-5), which may contain many image elements

(like the one in line 3-5), which have the link to the picture as an attribute (line

3) and the caption as a further element (line 4-5).

1 <?xml version="1.0" encoding="iso-8859-2"?>

2 <pdf src="10094677.pdf">

3 <img src="pics/10094677.Im4.jpg">

4 <caption>FIG. 4. DNase I footprint analysis of ...

5 </caption></img></pdf>

Alternatively, there may be a text to show, if the image is not available. That

way we can also use SIL as a regular search engine. For the indexing itself, we

use the Lucene package [21], which offers fast, Java-based indexing, but also

some additional functionality, like a built-in query parser and several so-called

analyser that allow us to vary how exactly the captions are indexed and what

defines a term.

Mining Complex Data

Search WWH ::

Custom Search

Home