Information Technology Reference
In-Depth Information
ovl ( w 1 ,w 2 ):=( w 1 .x 1 <w 2 .x 2 )
( w 1 .x 2 >w 2 .x 1 )
( w 1 .y 1 + w 1 .height
w 2 .y 2 )
( w 1 .y 2
w 1 .height
w 2 .y 1 ))
Two words w 1 and w 2 overlap, if left side of w 1 is left of the right side of w 2 and
vice versa. Also, they have to be in vertical proximity to each other. x 1 ,x 2 ,y 1
and y 2 represent the bounding box of the word in two points: ( x 1 ,y 1 )isthelower
left corner and ( x 2 ,y 2 ) the upper right corner. The zero-point is in the upper
left corner of the page.
As most of the words are overlapping with each other, common text should
be recognised as one unit. Table columns, in contrast, do not overlap and should
thus be recognisable (also cf. Fig. 1.5). A simple distinction based on the typical
number of neighbours in a unit allows a broad classification into text and table
units.
Fig. 1.5. Left: Result of the overlapping algorithm on a text unit; Right: Result of the
overlapping algorithm on a table unit
The starting point for identifying the structure of the tables is the columns.
These have to be aligned both vertically and horizontally. But, as you can see in
Fig. 1.6 the words do not fit precisely. Instead, a margin point is rather a margin
area with two border points. Since tables may come with a variety of special
cases, for example, melted columns or two rows of text in a cell, the matching
does not have to be precise. When a table is established, units in the proximity
are tested, if they fit the pattern. That way, solitary words are reintroduced,
when they fit.
We pre-evaluated our method on 86 scientific documents that included 92
tables. For the purpose of annotation it is most important to reach a high recall,
as missed hits are much harder to find than sorting out wrong hits. Adjusting
the parameters to that goal, we were able to reach a very high recall: 91%,
although many of the tables were only found partially. Unfortunately, it lowered
the precision to as much as 44%. A lot of those wrong tables were formulae with
matrices or multiple lines. Another source of wrong tables was that tables were
split in two, either because they were stretched out over two pages or they had
a vertical gap inside, when the headline of the column was very large, while the
values were quite short. We counted these as both a partial find and a wrong
table. Also see table 1.3 for a subsumption of the results.
 
Search WWH ::




Custom Search