Information Technology Reference
In-Depth Information
existed is re-created. It is challenging, since the person needs to have a fair amount
of knowledge of loan documents (hundreds of document categories) and work with
a high degree of attention to detail. Nevertheless, the error rate for this process can
be as high as 8%. The cost for the insertion is also significant, both in terms of labor
and material; it is estimated that 50% of the document preparation cost is used for
sorting and the insertion of separator sheets. One customer estimates the printing
cost for separator sheets alone to be in excess of $1M per year.
In the automated solution presented here [1], the loan files still need to be col-
lected and shipped to a central facility for processing. 2 At the facility, the batches
are scanned in their entirety, without inserting separator sheets beforehand. The
result of this process is a long sequence of images of pages, up to 2000 images per
batch. Next, the text on each page is read by an OCR engine. A classification engine
(see Section 8.4.2) determines the likely document types of loan documents (e.g.,
Appraisals, Tax Forms, etc.), and a separation mechanism (see Section 8.4.3) inserts
virtual boundaries between pages to indicate where one document ends and the next
one begins. The separated documents are then labeled accordingly and delivered for
further processing, e.g., the extraction of relevant information.
8.2 Related Work
Traditionally, the processing of scanned paper forms has concentrated on the han-
dling of structured forms. These are paper documents that have well-defined physical
areas in which to insert information, such as the social security number and income
information on tax forms. Ideally, for these forms the separation problem does not
even arise, since the documents are of a specified length. If, however, a sequence of
documents needs to be separated, it is usually enough to concentrate a recognition
process on the first page to find out which document is present. This information
defines the number of pages in the document and thus the separation information
with certainty. The recognition process is often done indirectly, coupled with a sub-
sequent extraction system. Extraction rules define areas of interest on a form and
how to gather data from those zones . For instance, an extraction rule for a tax form
could first specify how to identify a box on the top left corner of the document that
contains the text “1040.” Then the rule would search down the form to find a rect-
angular box labeled “SSN” and extract the nine digits contained in a grid directly
to the right of the label. If this recognition rule succeeds, i.e., text can be found and
recognized with su cient confidence, the document is identified as a two-page 1040
tax form and the social security number is extracted.
While such local, forms-based rules work extremely well in their area of appli-
cation, the extension of this approach to less structured forms or even forms that
exist in a large number of variations is highly effort-intensive and error-prone. For
instance, the example above treated federal tax forms, of which there are only a few
varieties. However, there are at least fifty varieties of state tax forms, and defining
2 We do not discuss distributed scanning operations here. The principle in this
case is that no paper documents are ever shipped, but that each local o ce scans
the documents that are created locally. The images of the documents are then
transferred to a central facility. This operational schema presents some of the
same and some additional complications.
Search WWH ::




Custom Search