Information Technology Reference
In-Depth Information
8
Automatic Document Separation:
A Combination of Probabilistic Classification
and Finite-State Sequence Modeling
Mauritius A. R. Schmidtler, and Jan W. Amtrup
8.1 Introduction
Large organizations are increasingly confronted with the problem of capturing, pro-
cessing, and archiving large amounts of data. For several reasons, the problem is
especially cumbersome in the case where data is stored on paper. First, the weight,
volume, and relative fragility of paper incur problems in handling and require spe-
cific, labor-intensive processes to be applied. Second, for automatic processing, the
information contained on the pages must be digitized, performing Optical Character
Recognition (OCR). This leads to a certain number of errors in the data retrieved
from paper. Third, the identities of individual documents become blurred. In a stack
of paper, the boundaries between documents are lost, or at least obscured to a large
degree. 1
As an example, consider the processing of loan documents in the mortgage in-
dustry: Usually, documents originate at local branch o ces of an organization (e.g.,
bank branches, when a customer fills out and signs the necessary forms and provides
additional information). All loan documents finalized at a local o ce on a given day
are collated into one stack of paper (called a batch ) and sent via surface mail to a
centralized processing facility. At that facility, the arriving packets from all over the
country are opened and the batches are scanned. In order to define the boundaries
and identities of documents, separator sheets are manually inserted in the batches.
Separator sheets are special pages that carry a barcode identifying the specific loan
document that follows the sheet, e.g., Final Loan Application or Tax Form, etc. The
separation and identification of the documents is necessary for archival and future
retrieval of specific documents. It is also a precondition for further processing, for
instance in order to facilitate the extraction of certain key information, e.g., the loan
number, property address and the like.
The problem we are addressing in this chapter is the process of manually in-
serting separator sheets into loan files. A person must take a loan file, leaf through
the stack of paper (hundreds of pages), and insert appropriate separators at the
correct boundary points. This work is both tedious and challenging. It is tedious,
since no important new information is created, but only information that previously
1 Notwithstanding physical markers such as staples, etc. Those are usually removed
as a first step in document processing.
 
Search WWH ::




Custom Search