Information Technology Reference
In-Depth Information
Algorithm 1 General document separation
Require: An ordered list of pages (a batch), consisting of pages p 1 to p n .
Require: An FST rules that describes the possible sequences of documents, as in
Figure 8.7.
Ensure: An ordered list of documents d k , each of which consists of an ordered list
of pages.
{ Perform Classification }
2: for all pages p i , 1 ≤ i ≤ n in the batch do
3: for all classes c j do { Three classes per document type (start/middle/end) }
4: c ij probability that p i is of class c j
5: end for
6: end for
{ Perform Separation }
8: pg ← The FST representing the classification results, as in Figure 8.6
9: sg ← pg composed with rules
10: sg ← the best path through sg
{ Create documents }
12: D ←∅
13: d ←∅
14: for all Transitions in sg , in topological order do
15: if The transition is labeled with “ End” then
16: D += d
17: else if The transition is labeled with a page then
18: d += p
19: end if
20: end for
The goal is to find a path through the composed FST from the start state to an
end state under the constraint that we are interested in the highest possible overall
probability. Any graph search method can be applied. However, the topology of
the input graph simplifies the problem somewhat (see below). For convenience, we
represent the result of the search also as an FST. The separation algorithm can now
be described by Algorithm 1.
Representing rules about the sequence of document types as a graph has more
far-reaching applications than just to make sure that document boundaries are ob-
served. For instance, limits about the size of documents can now be easily introduced.
Figure 8.8 shows a rule FST that prescribes all Tax forms must be at least two pages
and at most five pages long.
p:p/1.0
TaxForm_End:
TaxForm_End/1.0
p:p/1.0
TaxForm_Start:
TaxForm_Start/1.0
p:p/1.0
End
p:p/1.0
p:p/1.0
Start
Fig. 8.8. A rule FST restricting tax forms to between two and five pages long
 
Search WWH ::




Custom Search