Information Technology Reference
In-Depth Information
Fig. 1.2. UML model of the ELL format
be checked against it. The various implemented algorithms also use classes in a
similar structure.
The model has to fulfil multiple functions at once. First, it represents all
the meta-information we gain from the text conversion. Also, there is room for
further structuring. Algorithms can nest the text in a recursive data structure
and can include additional meta-information on the objects.
The XML document structure is HTML-like so the results can be easily
verified with a web browser. The original plain text conversion is preserved,
so by simply deleting all tags from the document, the text can be restored at all
steps. All metadata is stored in the attributes.
1.3.4
Data Cleansing
The conversion process from PDF to XML documents is not fool-proof. Although
the PDF specification is publicly available, many PDFs do not adhere to the
recommendations given there, but instead rely on a visual correct appearance
in the viewers. To cope with wrong input data, we add a data cleansing step to
filter out faulty documents.
The most common errors in PDF conversion are caused by wrong encoding
information for a specific font. Usually, each font provides a dictionary to match
each character in a font with the Unicode character it represents. Should this
be missing, we try to guess by assuming the encoding is similar to ASCII. To
catch wrong interpretation though, all of our tests inspect the document text by
splitting it into the fonts used. Each text from each font in each document must
pass three tests or is removed.
First, it is checked for non-ASCII symbols. These occur with wrong font in-
formation and are critical, as they disable the document for XML validation
later. The second test removes files which do not contain at least 1000 bytes of
document text. This case occurs in documents which are only a series of scanned
images without text information or documents which are just an abstract or a
form. Most scientific papers contain at least 20 kb of text.
The third test is a cumulative quota of suspicious characters over all characters
in the text. If this quota is above 10 percent, the document is rejected. We mark
Search WWH ::




Custom Search