Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

the form and contents of every such form is a major undertaking. But even then,

the layout of the forms is known and can, in principle, be described in advance.

For other semi-structured forms, this is not the case. For instance, appraisals (as

in the case of mortgage loan applications) always contain roughly the same type

of information (property address, value, comparable objects in the vicinity of the

property in question, etc.). However, recognizing an appraisal based on very local

information about specific structural properties of the form is extremely di cult.

The layout of appraisals from different sources can not be foreseen. As such, the

search for specific items on a page, using this information as an indication of what

form is present and, more importantly for the case discussed here, the length of the

document, are highly uncertain.

This is even more pronounced for so-called unstructured forms which have no

specific layout considerations. Those also appear in concrete business cases, such

as legal documents, waivers, riders, etc. Here, a layout-based definition of forms is

highly unlikely to succeed.

The conclusion is that for a large variety of important documents, a rule-driven

layout-based recognition is possibly inferior to a content-based recognition, as is used

in the present solution. This is still true if a subsequent extraction step is used to

gather information from the documents. Distinguishing between the separation step

and the extraction step can facilitate the process of writing rules for information

extraction, since the identity of documents can now be taken for granted. 3

The cost of maintaining a solution for separation (and extraction) also needs

to be considered, since it is highly likely that the layout of forms changes over

time. Except in very specific circumstances, the extent and form of the change is

out of control of the maintainer of a separation solution. This entails monitoring

the incoming forms for such changes and the rules governing recognition must be

modified immediately once a change is observed.

From the preceding discussion, it seems to us that treating separation and ex-

traction as two distinct steps is advantageous. Furthermore, we favor content-based

and example-based methods over manually written layout rules. The exact form of

features used (e.g., image-based or text-based) is unspecified in principle. However,

based on the experiences in our application domain, we prefer text-based features

(see below).

The most direct approach to document separation would treat the task as

a straightforward segmentation problem. Maximum Entropy (ME) methods have

proven very successful in the area of segmentation of natural language sentences

[2, 3]. Each boundary (in our case the point between two pages) is characterized by

features of its environment (e.g., by the words used on the preceding and following

page). An ME classifier is then used to solve the binary problem (boundary/non-

boundary) for new, unseen page transitions. We are unaware of any publication

using this approach for automatic document separation.

Instead of looking for boundaries, one could also attempt to ascertain that two

consecutive pages belong in the same document, thus indirectly establishing borders.

3 Note that this still assumes that rules are used to identify local information

on a page. It may also be possible to handle the extraction step in a content-

based manner, focusing not on the layout of a page, but on the words on it.

The respective merits of each of these method is beyond the scope of the present

chapter.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home