Information Technology Reference
In-Depth Information
semantic information prevails, like text. The disadvantage though, is that layout
identification becomes more complicated, since more objects are involved.
By convention, the parameters of a command are written before the command
and all commands are abbreviated to two letters. All text-related commands are
between a BT (line 1) and an ET (line 8) which stands for Begin Text and End
Text, respectively. The Tj and TJ commands include the text actually written.
The other commands are describing where and how the text is written.
1BT
2800852757.35 Tm
3 /F2 1 Tf
4 0 -1.706 TD
5 (page 354)Tj
6T*
7 [ (J) -27 (OURN) 27 (AL) -378 (1) ]TJ
8ET
For a correct layout analysis and indexing, it is vital to identify correct word
borders; otherwise words may be glued together or torn apart. Also, without
spaces the text is undistinguishable from a large table. Unfortunately, the spaces
are at times not given directly, but instead the characters are just a little more
apart from each other than usual. The problem sharpens as theoretically all
characters can be written in any kind of order by jumping around with explicitly
set coordinates.
In order to identify the spaces anyway, our first run through the text stream
just extracts the characters one by one and calculates their bounding boxes.
Then the difference vector x diff between two adjacent characters is calculated
and rotated in writing direction R .
rotationmatrix R = x old,right
x old,lef t y old,right
y old,lef t
y old,right + y old,lef t x old,right
x old,lef t
x old,right ) R
|
x diff =( x new,lef t
|
The resulting vector is compared to the current modified font size to determine
whether this is a space, no space, carriage return or a new block of text. Next,
the blocks are sorted and go through a similar procedure. This way the initial
information about the order is conserved best.
The words bounding boxes are extracted, as are all changes in fonts or font
size. Additional problems which arise are: text overlaps, when e.g. a special font
is used to write the accent over a that overlaps the original “a” and the overall
handling of non-identifiable fonts and fonts that give wrong bounding boxes. The
results of these calculations are stored together with text in the ELL file format.
R
1.3.3 Structure in ELL
To get a better overview, we modelled the data structure in UML (cf. Fig. 2).
The UML version is then transferred to XML Schema, so all the XML files can
Search WWH ::




Custom Search