Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data - page 6

Information Technology Reference

In-Depth Information

Fig. 1.2. UML model of the ELL format

be checked against it. The various implemented algorithms also use classes in a

similar structure.

The model has to fulfil multiple functions at once. First, it represents all

the meta-information we gain from the text conversion. Also, there is room for

further structuring. Algorithms can nest the text in a recursive data structure

and can include additional meta-information on the objects.

The XML document structure is HTML-like so the results can be easily

verified with a web browser. The original plain text conversion is preserved,

so by simply deleting all tags from the document, the text can be restored at all

steps. All metadata is stored in the attributes.

1.3.4

Data Cleansing

The conversion process from PDF to XML documents is not fool-proof. Although

the PDF specification is publicly available, many PDFs do not adhere to the

recommendations given there, but instead rely on a visual correct appearance

in the viewers. To cope with wrong input data, we add a data cleansing step to

filter out faulty documents.

The most common errors in PDF conversion are caused by wrong encoding

information for a specific font. Usually, each font provides a dictionary to match

each character in a font with the Unicode character it represents. Should this

be missing, we try to guess by assuming the encoding is similar to ASCII. To

catch wrong interpretation though, all of our tests inspect the document text by

splitting it into the fonts used. Each text from each font in each document must

pass three tests or is removed.

First, it is checked for non-ASCII symbols. These occur with wrong font in-

formation and are critical, as they disable the document for XML validation

later. The second test removes files which do not contain at least 1000 bytes of

document text. This case occurs in documents which are only a series of scanned

images without text information or documents which are just an abstract or a

form. Most scientific papers contain at least 20 kb of text.

The third test is a cumulative quota of suspicious characters over all characters

in the text. If this quota is above 10 percent, the document is rejected. We mark

Next Page

Mining Complex Data

Search WWH ::

Custom Search

Home