Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Similarly, the rule FST can be used to make demands about the order of doc-

uments, depending on the application. For instance, a mortgage application could

prescribe that an Appraisal is always directly followed by a Note. Using a powerful

representation mechanism such as finite state transducers simplifies the introduction

of additional functionality significantly.

However, there are also drawbacks to the naive implementation of operations over

finite state transducers. Additional representational power in this case comes at a

price, mainly in terms of memory consumption and secondary in processing time. A

typical mortgage application defines somewhere between 50 and 200 document types

and consequently between 150 and 600 classes for which probabilities and graphs

have to be produced. A batch may be as long as 1000 pages. The composition of

the resulting classification graph with the rule graph requires a great deal of space

to hold and time to construct. We experienced graph sizes of over one gigabyte and

runtime reached tens of minutes, clearly insu cient for the successful application in

industry.

This situation is again similar to speech recognition with language modeling.

A language model represents an extremely large number of possible and probable

word sequences. Incorporating this knowledge into the basic recognition trellis is

infeasible due to both space and time restrictions. Thus, both knowledge sources

are kept separate. The probabilities delivered by the language model are taken into

account by the search for the best word sequence. For our problem, we also notice

that the composition of the classification and rules FSTs is transient; in the end,

we are interested only in the best path through the composition FST. Thus, it

is unnecessary to completely unfold the composition. Instead, we use a technique

similar to delayed composition [17] combined with beam search [18] to extract the

final result. This reduces memory usage significantly for large problems. Table 8.4

shows the memory usage for a few data points, all of which are well within the

limits of our requirements. The runtime is also much lower; for 1000 pages and

200 categories, the processing time decreases from roughly 16 minutes for the naive

approach to less than 3 minutes for the advanced procedure.

Table 8.4. Memory usage for separation

Batch size Memory Usage

in pages 200 categories 300 categories

1000

118 MB

151 MB

2000

211 MB

363 MB

8.5 Results

We evaluated the performance of all probability models for sequences that we de-

scribed in Section 8.4.2. Table 8.5 shows the F 1-values for all models. The table

suggests three main results:

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home