Information Technology Reference
In-Depth Information
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Start
...
...
...
...
...
...
...
...
End
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Fig. 8.6. Classification results as an FST
that using this FST, it is still possible to generate invalid page sequences. For in-
stance, a start page for an Appraisal could be followed by another such start page. A
restriction transducer plays a role analogous to a language model and ensures that
such invalid paths are not present in the final search trellis.
A restriction (or rule) transducer contains labels on both the lower and upper
side. The transformation from page content to document type/page type symbols
has already been defined by the classification FST. The rule transducer has no infor-
mational role anymore, but a pure filtering role of eliminating impossible sequences
of document and page types. Figure 8.7 shows the rules for a sequence containing
documents of three document types. Note that all weights on the transitions are
given as 1.0, so as not to modify the probabilities from the page information.
p:p/1.0
TaxForm_End:TaxForm_End/1.0
TaxForm_Start:TaxForm_Start/1.0
p:p/1.0
Appraisal_End:Appraisal_End/1.0
Appraisal_Start:Appraisal_Start/1.0
Start
p:p/1.0
Note_End:Note_End/1.0
End
Note_Start:Note_Start/1.0
Fig. 8.7. FST for three document types
The composition operation on finite state transducers can now be used to gen-
erate a transducer that only contains such sequences of pages that result in a valid
sequence of documents. Composition treats the “output” of one transducer (the up-
per level) as the “input” (lower level) of the other transducer. If we compose the
classification FST with the rule FST, we achieve the desired result, an FST that
contains the probabilities for specific pages, but only contains valid sequences. The
output of this operation can be quite large, though. In the worst case, the composed
FST has a number of states that is the product of the number of states of both
arguments. In practice, this upper limit is not reached, but the size of the composed
FST is still a concern, which we will address below.
Search WWH ::




Custom Search