Information Technology Reference
In-Depth Information
characters as suspicious, if they do not seem to be English text. We started
with the language classifier Lingua::Identify by Jose Castro from CPAN. But
the results were not very stable, because the library is not designed to identify
text not written in a natural language and also scientific papers include a lot of
words which do not seem to be English. Therefore, we implemented four simple
heuristics to detect conversion flaws. Each one either can either mark a font as
suspicious or a continuous piece of text.
The first heuristic makes sure that the text in a font contains at least one
whitespace. If there is not a single space in the text written in this font, it is
marked as suspicious. As omitted spaces are a common conversion problem, the
second heuristic is similar by searching for words with more than 20 letters, with
the explicit exception of sequence data. The third marks possible encoding errors,
where a character in PDF without a proper encoding reference is converted as
letter and number code, like M150. To avoid special names, this pattern must
occur in at least 2 consecutive characters. Therefore, single characters, especially
mathematical operators, remain undetected. Finally, the fourth heuristic marks
all characters which are outside the 7-bit ASCII alphabet.
The data cleansing step is quite fast and can process more than a document
per second. In a quite modern test data set 19 out of 605 documents (3.1%) were
filtered out. 10 of the documents did not contain enough text, 8 did fail the quota
and one included an improper symbol. All of these papers were unintelligible
from a text extraction point of view. In older documents this rate can be higher,
as older converters tended to emphasis more on shortness than legibility.
1.3.5
CaptionSearch : The Web Application
As a simple demonstration, how the layout information can be used, a web
based search engine is used. It is called CaptionSearch, since we are technically
not searching for images or tables, but for the caption text beneath them.
After an image or table extraction has been done, the results are written to
an XML file in the SIL language. Each publication is represented by one file
containing a PDF element (line 2-5), which may contain many image elements
(like the one in line 3-5), which have the link to the picture as an attribute (line
3) and the caption as a further element (line 4-5).
1 <?xml version="1.0" encoding="iso-8859-2"?>
2 <pdf src="10094677.pdf">
3 <img src="pics/10094677.Im4.jpg">
4 <caption>FIG. 4. DNase I footprint analysis of ...
5 </caption></img></pdf>
Alternatively, there may be a text to show, if the image is not available. That
way we can also use SIL as a regular search engine. For the indexing itself, we
use the Lucene package [21], which offers fast, Java-based indexing, but also
some additional functionality, like a built-in query parser and several so-called
analyser that allow us to vary how exactly the captions are indexed and what
defines a term.
 
Search WWH ::




Custom Search