Information Technology Reference
In-Depth Information
States
Appraisal
0.9
0.5
0.5
0.8
TaxForm
0.7
0.9
0.9
0.8
···
Note
0.5
0.7
0.7
0.3
time
page 1
page 2
page 3
page 4
Fig. 8.3. A trellis for the model of Eq. 8.2.
ability and thus enables an effective mapping of SVM scores to probabilities. The
absolute value of the cost factor is an estimate of the optimal tradeoff between mem-
orization and generalization and thus, enables an e cient handling of the noisy data.
This together with mapping the scores to probabilities allows us to effectively utilize
Support Vector Machines with their superior learning paradigm for the estimation
of the sequence models.
8.4.3 Sequence Processing
In the previous two sections, we outlined the different probability models that can
be applied to the problem of document separation and the approach to classification
that we have taken to arrive at probabilities for categories attached to pages. All
probability models were based on viewing the classification and separation process
as a sequence mapping problem, described formally as a Markov chain as in Eq. 8.1.
Experience from information extraction and speech recognition (e.g., [15]) shows
that the results of such mappings and the search for the best sequence can be
represented as a trellis . A trellis is a two-dimensional graph in which the horizontal
axis represents time steps. For speech recognition, this would be incoming acoustic
feature vectors. For information extraction, it could be words and, in our case, each
time step is an incoming page within a batch. The vertical axis represents the states
in which the mapping process may find itself and also the possible output symbols
it may generate.
Transitions from states for one time step to the next denote the larger structure
of the problem. These transitions can also be annotated with probabilities.
Consider the very simple model of Eq. 8.2, in which the probability of a document
type only depends on the content of a page. In the trellis, there are as many states
as there are document types. The interesting value for such a state is the probability
that the page is of the associated document type. There are transitions from each
state for one time step to all states for the next time step, indicating that each
following state is equally probable. Figure 8.3 shows part of such a trellis.
The question of what a “best sequence” is can easily be answered: Since the
individual scores delivered are probabilities (due to calibration), the probability of
the complete sequence can be modeled using the product of all scores encountered on
a path from the first page to the last. This sequence can be computed using Viterbi
Search WWH ::




Custom Search