Information Technology Reference
In-Depth Information
Algorithm 1
General document separation
Require:
An ordered list of pages (a batch), consisting of pages
p
1
to
p
n
.
Require:
An FST
rules
that describes the possible sequences of documents, as in
Figure 8.7.
Ensure:
An ordered list of documents
d
k
, each of which consists of an ordered list
of pages.
{
Perform Classification
}
2:
for all
pages
p
i
,
1
≤ i ≤ n
in the batch
do
3:
for all
classes
c
j
do
{
Three classes per document type (start/middle/end)
}
4:
c
ij
←
probability that
p
i
is of class
c
j
5:
end for
6:
end for
{
Perform Separation
}
8:
pg ←
The FST representing the classification results, as in Figure 8.6
9:
sg ← pg
composed with
rules
10:
sg ←
the best path through
sg
{
Create documents
}
12:
D ←∅
13:
d ←∅
14:
for all
Transitions in
sg
, in topological order
do
15:
if
The transition is labeled with “ End”
then
16:
D
+=
d
17:
else if
The transition is labeled with a page
then
18:
d
+=
p
19:
end if
20:
end for
The goal is to find a path through the composed FST from the start state to an
end state under the constraint that we are interested in the highest possible overall
probability. Any graph search method can be applied. However, the topology of
the input graph simplifies the problem somewhat (see below). For convenience, we
represent the result of the search also as an FST. The separation algorithm can now
be described by Algorithm 1.
Representing rules about the sequence of document types as a graph has more
far-reaching applications than just to make sure that document boundaries are ob-
served. For instance, limits about the size of documents can now be easily introduced.
Figure 8.8 shows a rule FST that prescribes all Tax forms must be at least two pages
and at most five pages long.
p:p/1.0
TaxForm_End:
TaxForm_End/1.0
p:p/1.0
TaxForm_Start:
TaxForm_Start/1.0
p:p/1.0
End
p:p/1.0
p:p/1.0
Start
Fig. 8.8.
A rule FST restricting tax forms to between two and five pages long
Search WWH ::
Custom Search