Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

existed is re-created. It is challenging, since the person needs to have a fair amount

of knowledge of loan documents (hundreds of document categories) and work with

a high degree of attention to detail. Nevertheless, the error rate for this process can

be as high as 8%. The cost for the insertion is also significant, both in terms of labor

and material; it is estimated that 50% of the document preparation cost is used for

sorting and the insertion of separator sheets. One customer estimates the printing

cost for separator sheets alone to be in excess of $1M per year.

In the automated solution presented here [1], the loan files still need to be col-

lected and shipped to a central facility for processing. 2 At the facility, the batches

are scanned in their entirety, without inserting separator sheets beforehand. The

result of this process is a long sequence of images of pages, up to 2000 images per

batch. Next, the text on each page is read by an OCR engine. A classification engine

(see Section 8.4.2) determines the likely document types of loan documents (e.g.,

Appraisals, Tax Forms, etc.), and a separation mechanism (see Section 8.4.3) inserts

virtual boundaries between pages to indicate where one document ends and the next

one begins. The separated documents are then labeled accordingly and delivered for

further processing, e.g., the extraction of relevant information.

8.2 Related Work

Traditionally, the processing of scanned paper forms has concentrated on the han-

dling of structured forms. These are paper documents that have well-defined physical

areas in which to insert information, such as the social security number and income

information on tax forms. Ideally, for these forms the separation problem does not

even arise, since the documents are of a specified length. If, however, a sequence of

documents needs to be separated, it is usually enough to concentrate a recognition

process on the first page to find out which document is present. This information

defines the number of pages in the document and thus the separation information

with certainty. The recognition process is often done indirectly, coupled with a sub-

sequent extraction system. Extraction rules define areas of interest on a form and

how to gather data from those zones . For instance, an extraction rule for a tax form

could first specify how to identify a box on the top left corner of the document that

contains the text “1040.” Then the rule would search down the form to find a rect-

angular box labeled “SSN” and extract the nine digits contained in a grid directly

to the right of the label. If this recognition rule succeeds, i.e., text can be found and

recognized with su cient confidence, the document is identified as a two-page 1040

tax form and the social security number is extracted.

While such local, forms-based rules work extremely well in their area of appli-

cation, the extension of this approach to less structured forms or even forms that

exist in a large number of variations is highly effort-intensive and error-prone. For

instance, the example above treated federal tax forms, of which there are only a few

varieties. However, there are at least fifty varieties of state tax forms, and defining

2 We do not discuss distributed scanning operations here. The principle in this

case is that no paper documents are ever shipped, but that each local o ce scans

the documents that are created locally. The images of the documents are then

transferred to a central facility. This operational schema presents some of the

same and some additional complications.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home