Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

The training for this model required at least 20 examples per category, 10 each for

the training and as a hold-out set. The maximum number of examples per category

was capped at 40. Initially, the feature space had a dimensionality of 620,455. We

reduced this number to at most 20,000 features per category by applying mutual

information feature selection.

A comparison between the different problems and the models we apply is instruc-

tive. In Table 8.3, we reach an F 1-value of 95% for the classification of documents.

There, the boundaries are given, and the classifier is able to use all words from all

pages in the document. In the experiments we report in Table 8.5, the problem is

more complex: Each page must be classified separately and document boundaries

inferred. Applying a comparable model (Eq. 8.2) in this situation, we only reach

an F 1-value of 63%. Only by careful selection of an appropriate probability model,

we are able to raise the performance to an F 1-value of 92% on the page level with

model (Eq. 8.6).

One should note that the scores delivered by the SVM multi-class classifier are

calibrated and represent class membership probabilities. Thus, thresholding can

be applied to control the amount of errors that a customer expects from automatic

decisions, and to control the amount of manual review of decisions that have been re-

jected. Using this technique, we can achieve precision of > 95% while simultaneously

keeping the recall above 80%.

8.5.1 Production Deployments

The deployment of an automatic document separation solution is a lengthy process,

as is common for any workflow-changing installation in large organizations. Most

often, a proof-of-concept phase precedes the deployment proper. This part of a

project can be pre-sales in order to demonstrate the feasibility of the approach to

the customer or it can be as the first step in a deployment to find out how much

automation can be introduced with high accuracy. In a proof of concept (POC), only

a small subset of document types are considered for classification and separation.

This poses a set of unique problems to consider: The document separator is normally

set up to classify all documents into a set of well-known and well-defined document

types. In a POC, only a subset of document types (say, 10 out of 50) is relevant.

However, the incoming batches still contain documents of all types. The challenge

here is to “actively ignore” the remaining document types without adverse effects

on the classification and separation results for the document types on which we are

concentrating.

The deployment of the production version of the separation solution can take

as long as six months for a medium-sized organization (separating between five and

ten million pages per month). Of this time, two to three months are usually spent

on configuring the software. This includes the setup of the training data for the

separator but also the development of an extraction mechanism that is usually part

of the larger workflow. The rest of the time is used for the purchase and installation of

hardware (possibly new scanners, processing machines, and review stations) and the

retraining of the review personnel. It is good practice to introduce the new workflow

and automated solution in increments, first converting one or two production lines to

the automated separation solution and reviewing the e ciency of the process. Once

the hardware, software and workflow function satisfactorily, the remaining lines are

activated.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home