Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

8

Automatic Document Separation:

A Combination of Probabilistic Classification

and Finite-State Sequence Modeling

Mauritius A. R. Schmidtler, and Jan W. Amtrup

8.1 Introduction

Large organizations are increasingly confronted with the problem of capturing, pro-

cessing, and archiving large amounts of data. For several reasons, the problem is

especially cumbersome in the case where data is stored on paper. First, the weight,

volume, and relative fragility of paper incur problems in handling and require spe-

cific, labor-intensive processes to be applied. Second, for automatic processing, the

information contained on the pages must be digitized, performing Optical Character

Recognition (OCR). This leads to a certain number of errors in the data retrieved

from paper. Third, the identities of individual documents become blurred. In a stack

of paper, the boundaries between documents are lost, or at least obscured to a large

degree. 1

As an example, consider the processing of loan documents in the mortgage in-

dustry: Usually, documents originate at local branch o ces of an organization (e.g.,

bank branches, when a customer fills out and signs the necessary forms and provides

additional information). All loan documents finalized at a local o ce on a given day

are collated into one stack of paper (called a batch ) and sent via surface mail to a

centralized processing facility. At that facility, the arriving packets from all over the

country are opened and the batches are scanned. In order to define the boundaries

and identities of documents, separator sheets are manually inserted in the batches.

Separator sheets are special pages that carry a barcode identifying the specific loan

document that follows the sheet, e.g., Final Loan Application or Tax Form, etc. The

separation and identification of the documents is necessary for archival and future

retrieval of specific documents. It is also a precondition for further processing, for

instance in order to facilitate the extraction of certain key information, e.g., the loan

number, property address and the like.

The problem we are addressing in this chapter is the process of manually in-

serting separator sheets into loan files. A person must take a loan file, leaf through

the stack of paper (hundreds of pages), and insert appropriate separators at the

correct boundary points. This work is both tedious and challenging. It is tedious,

since no important new information is created, but only information that previously

1 Notwithstanding physical markers such as staples, etc. Those are usually removed

as a first step in document processing.

Search WWH ::

Custom Search

Home