Information Technology Reference
In-Depth Information
the form and contents of every such form is a major undertaking. But even then,
the layout of the forms is known and can, in principle, be described in advance.
For other semi-structured forms, this is not the case. For instance, appraisals (as
in the case of mortgage loan applications) always contain roughly the same type
of information (property address, value, comparable objects in the vicinity of the
property in question, etc.). However, recognizing an appraisal based on very local
information about specific structural properties of the form is extremely di cult.
The layout of appraisals from different sources can not be foreseen. As such, the
search for specific items on a page, using this information as an indication of what
form is present and, more importantly for the case discussed here, the length of the
document, are highly uncertain.
This is even more pronounced for so-called unstructured forms which have no
specific layout considerations. Those also appear in concrete business cases, such
as legal documents, waivers, riders, etc. Here, a layout-based definition of forms is
highly unlikely to succeed.
The conclusion is that for a large variety of important documents, a rule-driven
layout-based recognition is possibly inferior to a content-based recognition, as is used
in the present solution. This is still true if a subsequent extraction step is used to
gather information from the documents. Distinguishing between the separation step
and the extraction step can facilitate the process of writing rules for information
extraction, since the identity of documents can now be taken for granted. 3
The cost of maintaining a solution for separation (and extraction) also needs
to be considered, since it is highly likely that the layout of forms changes over
time. Except in very specific circumstances, the extent and form of the change is
out of control of the maintainer of a separation solution. This entails monitoring
the incoming forms for such changes and the rules governing recognition must be
modified immediately once a change is observed.
From the preceding discussion, it seems to us that treating separation and ex-
traction as two distinct steps is advantageous. Furthermore, we favor content-based
and example-based methods over manually written layout rules. The exact form of
features used (e.g., image-based or text-based) is unspecified in principle. However,
based on the experiences in our application domain, we prefer text-based features
(see below).
The most direct approach to document separation would treat the task as
a straightforward segmentation problem. Maximum Entropy (ME) methods have
proven very successful in the area of segmentation of natural language sentences
[2, 3]. Each boundary (in our case the point between two pages) is characterized by
features of its environment (e.g., by the words used on the preceding and following
page). An ME classifier is then used to solve the binary problem (boundary/non-
boundary) for new, unseen page transitions. We are unaware of any publication
using this approach for automatic document separation.
Instead of looking for boundaries, one could also attempt to ascertain that two
consecutive pages belong in the same document, thus indirectly establishing borders.
3 Note that this still assumes that rules are used to identify local information
on a page. It may also be possible to handle the extraction step in a content-
based manner, focusing not on the layout of a page, but on the words on it.
The respective merits of each of these method is beyond the scope of the present
chapter.
Search WWH ::




Custom Search