Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

States

Appraisal

0.9

0.5

0.8

TaxForm

0.7

0.9

0.8

···

Note

0.5

0.7

0.3

time

page 1

page 2

page 3

page 4

Fig. 8.3. A trellis for the model of Eq. 8.2.

ability and thus enables an effective mapping of SVM scores to probabilities. The

absolute value of the cost factor is an estimate of the optimal tradeoff between mem-

orization and generalization and thus, enables an e cient handling of the noisy data.

This together with mapping the scores to probabilities allows us to effectively utilize

Support Vector Machines with their superior learning paradigm for the estimation

of the sequence models.

8.4.3 Sequence Processing

In the previous two sections, we outlined the different probability models that can

be applied to the problem of document separation and the approach to classification

that we have taken to arrive at probabilities for categories attached to pages. All

probability models were based on viewing the classification and separation process

as a sequence mapping problem, described formally as a Markov chain as in Eq. 8.1.

Experience from information extraction and speech recognition (e.g., [15]) shows

that the results of such mappings and the search for the best sequence can be

represented as a trellis . A trellis is a two-dimensional graph in which the horizontal

axis represents time steps. For speech recognition, this would be incoming acoustic

feature vectors. For information extraction, it could be words and, in our case, each

time step is an incoming page within a batch. The vertical axis represents the states

in which the mapping process may find itself and also the possible output symbols

it may generate.

Transitions from states for one time step to the next denote the larger structure

of the problem. These transitions can also be annotated with probabilities.

Consider the very simple model of Eq. 8.2, in which the probability of a document

type only depends on the content of a page. In the trellis, there are as many states

as there are document types. The interesting value for such a state is the probability

that the page is of the associated document type. There are transitions from each

state for one time step to all states for the next time step, indicating that each

following state is equally probable. Figure 8.3 shows part of such a trellis.

The question of what a “best sequence” is can easily be answered: Since the

individual scores delivered are probabilities (due to calibration), the probability of

the complete sequence can be modeled using the product of all scores encountered on

a path from the first page to the last. This sequence can be computed using Viterbi

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home