Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

ovl ( w 1 ,w 2 ):=( w 1 .x 1 <w 2 .x 2 )

∧

( w 1 .x 2 >w 2 .x 1 )

∧

( w 1 .y 1 + w 1 .height

≥

w 2 .y 2 )

∧

( w 1 .y 2 −

w 1 .height

≤

w 2 .y 1 ))

Two words w 1 and w 2 overlap, if left side of w 1 is left of the right side of w 2 and

vice versa. Also, they have to be in vertical proximity to each other. x 1 ,x 2 ,y 1

and y 2 represent the bounding box of the word in two points: ( x 1 ,y 1 )isthelower

left corner and ( x 2 ,y 2 ) the upper right corner. The zero-point is in the upper

left corner of the page.

As most of the words are overlapping with each other, common text should

be recognised as one unit. Table columns, in contrast, do not overlap and should

thus be recognisable (also cf. Fig. 1.5). A simple distinction based on the typical

number of neighbours in a unit allows a broad classification into text and table

units.

Fig. 1.5. Left: Result of the overlapping algorithm on a text unit; Right: Result of the

overlapping algorithm on a table unit

The starting point for identifying the structure of the tables is the columns.

These have to be aligned both vertically and horizontally. But, as you can see in

Fig. 1.6 the words do not fit precisely. Instead, a margin point is rather a margin

area with two border points. Since tables may come with a variety of special

cases, for example, melted columns or two rows of text in a cell, the matching

does not have to be precise. When a table is established, units in the proximity

are tested, if they fit the pattern. That way, solitary words are reintroduced,

when they fit.

We pre-evaluated our method on 86 scientific documents that included 92

tables. For the purpose of annotation it is most important to reach a high recall,

as missed hits are much harder to find than sorting out wrong hits. Adjusting

the parameters to that goal, we were able to reach a very high recall: 91%,

although many of the tables were only found partially. Unfortunately, it lowered

the precision to as much as 44%. A lot of those wrong tables were formulae with

matrices or multiple lines. Another source of wrong tables was that tables were

split in two, either because they were stretched out over two pages or they had

a vertical gap inside, when the headline of the column was very large, while the

values were quite short. We counted these as both a partial find and a wrong

table. Also see table 1.3 for a subsumption of the results.

Mining Complex Data

Search WWH ::

Custom Search

Home