Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Using an automated classification and separation solution yields significant ben-

efits for an organization. There are large cost-savings associated with the process

(in a manual solution, 50% of the preparation cost is spent on sorting and inserting

separator sheets) and the accuracy is superior. In a typical setting with hundreds

of document types, at least 95-98% precision can be attained at a recall level of at

least 80%. This means that only a fraction of the original data must be reviewed

and no operations have to be performed on the physical paper pages that are at the

source of the process.

8.6 Conclusion

In this chapter, we presented an automatic solution for the classification and separa-

tion of paper documents. The problem is to ingest a long sequence of images of paper

pages and to convert those into a sequence of documents with definite boundaries

and document types. In a manual setting, this process is costly and error prone.

The automatic solution we describe prepares the incoming pages by running them

through an OCR process to discover the text on the page. Basic NLP techniques

for segmentation and morphological processing are used to arrive at a description

of a page that associates stems with occurrence counts for a page (bag-of-words

model). An SVM classifier is applied to generate probabilities that pages are of a

given document and page type. After obtaining all classification probabilities, we are

using a finite state transducer-based approach to detect likely boundaries between

documents. Viewing this process as a sequence-mapping problem with well-defined

subareas such as probabilistic modeling, classification and sequence processing al-

lows us to fine-tune several aspects of the approach.

There were several major challenges in the development of this set of algorithms.

The outside constraints prescribed a solution with high performance, both in terms

of process accuracy and resource e ciency (time and hardware in setup and produc-

tion). These requirements have significant ramifications for the choice of algorithms

and models. For instance, Bayesian classifiers based on word n -grams are primarily

unsuited due to their high training data demands. Also, the composition and search

during separation had to be implemented in an on-demand fashion to comply with

memory size requirements.

The overall result is a system that — although relatively simple in its basic

components and methods — is very complex in its totality and its optimizations

on a component level. We consistently reach high performance of greater than 95%

precision with more than 80% recall and use the solution described here in large

deployments with several million pages throughput a month.

8.7 Acknowledgments

Developing and validating technology solutions that can eventually be turned into

successful products in the marketplace is an endeavor that includes many people.

The authors would like to thank all who participated in exploring the technological

and engineering problems of automatic document separation, in particular Tristan

Juricek, Scott Texeira, and Sameer Samat.

Search WWH ::

Custom Search

Home