Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

p:p/0.64

p:p/0.95

Taxform_Start

p:p/0.21

Taxform_End

Start

End

p:p/0.95

Taxform_End

Fig. 8.5. Classification results for one page

can be interpreted as a sequence of weighted finite state transducers (WFSTs) that

are combined using the composition operation. We adopt this view by associating

our trellis of page classification results with an acoustic model applied to some input

in speech recognition. The probabilities for an individual page to be of some class

correspond to the emission probabilities represented in the recognition trellis of a

speech recognizer. The restriction we placed on only allowing complete documents

to be part of a sequence of documents corresponds to the use of a language model

that renders certain word sequences more likely than others. 9 The “language model”

we use currently only contains binary probability values, modeling hard constraints.

However, similar to language models used in speech recognition, we could employ

graded constraints represented by probabilities on language model transitions. This

could be useful, for instance, in modeling the different likelihoods of sequences of

documents, should such sequences exist.

In order to apply this analogy, we need to define the topology and contents of two

finite state transducers. For the document type/page type model, the classification

results can be represented in an FST as shown in Figure 8.5. The transitions of a

classification transducer are of two kinds:

•

Transitions that represent physical pages contain a symbol indicating a physical

page on the lower and upper level and a classification score as weight. Which

score is attached to the page depends on the topology of the transducer, which

is defined by the next type of transitions.

•

Transitions with an empty lower level denote boundary information about doc-

uments. There are transitions for the start and the end of a document. The

occurrence of these transitions thus defines the type of page and the type of

score that should be used. For instance, in Figure 8.5, the topmost transition

(with score 0.64) indicates a middle page, since there are no boundaries given.

The second transition chain belongs to a form that contains only a single page

and consequently is bounded by both a start indicator and an end indicator.

The third and fourth transitions belong to start and end pages respectively.

Figure 8.5 contains the information necessary to represent the classification re-

sults for one page with regard to one document type. The complete FST representing

a problem with three document types and four pages is shown in Figure 8.6. Note

9 On a more basic level, the document sequence restrictions can also be likened

to the use of a pronunciation dictionary within a speech recognizer. However,

acoustic modeling and pronunciation dictionary are usually combined into one

processing step, while we explicitly distinguish between these.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home