Information Technology Reference
In-Depth Information
Using an automated classification and separation solution yields significant ben-
efits for an organization. There are large cost-savings associated with the process
(in a manual solution, 50% of the preparation cost is spent on sorting and inserting
separator sheets) and the accuracy is superior. In a typical setting with hundreds
of document types, at least 95-98% precision can be attained at a recall level of at
least 80%. This means that only a fraction of the original data must be reviewed
and no operations have to be performed on the physical paper pages that are at the
source of the process.
8.6 Conclusion
In this chapter, we presented an automatic solution for the classification and separa-
tion of paper documents. The problem is to ingest a long sequence of images of paper
pages and to convert those into a sequence of documents with definite boundaries
and document types. In a manual setting, this process is costly and error prone.
The automatic solution we describe prepares the incoming pages by running them
through an OCR process to discover the text on the page. Basic NLP techniques
for segmentation and morphological processing are used to arrive at a description
of a page that associates stems with occurrence counts for a page (bag-of-words
model). An SVM classifier is applied to generate probabilities that pages are of a
given document and page type. After obtaining all classification probabilities, we are
using a finite state transducer-based approach to detect likely boundaries between
documents. Viewing this process as a sequence-mapping problem with well-defined
subareas such as probabilistic modeling, classification and sequence processing al-
lows us to fine-tune several aspects of the approach.
There were several major challenges in the development of this set of algorithms.
The outside constraints prescribed a solution with high performance, both in terms
of process accuracy and resource e ciency (time and hardware in setup and produc-
tion). These requirements have significant ramifications for the choice of algorithms
and models. For instance, Bayesian classifiers based on word n -grams are primarily
unsuited due to their high training data demands. Also, the composition and search
during separation had to be implemented in an on-demand fashion to comply with
memory size requirements.
The overall result is a system that — although relatively simple in its basic
components and methods — is very complex in its totality and its optimizations
on a component level. We consistently reach high performance of greater than 95%
precision with more than 80% recall and use the solution described here in large
deployments with several million pages throughput a month.
8.7 Acknowledgments
Developing and validating technology solutions that can eventually be turned into
successful products in the marketplace is an endeavor that includes many people.
The authors would like to thank all who participated in exploring the technological
and engineering problems of automatic document separation, in particular Tristan
Juricek, Scott Texeira, and Sameer Samat.
Search WWH ::




Custom Search