Information Technology Reference
In-Depth Information
The training for this model required at least 20 examples per category, 10 each for
the training and as a hold-out set. The maximum number of examples per category
was capped at 40. Initially, the feature space had a dimensionality of 620,455. We
reduced this number to at most 20,000 features per category by applying mutual
information feature selection.
A comparison between the different problems and the models we apply is instruc-
tive. In Table 8.3, we reach an F 1-value of 95% for the classification of documents.
There, the boundaries are given, and the classifier is able to use all words from all
pages in the document. In the experiments we report in Table 8.5, the problem is
more complex: Each page must be classified separately and document boundaries
inferred. Applying a comparable model (Eq. 8.2) in this situation, we only reach
an F 1-value of 63%. Only by careful selection of an appropriate probability model,
we are able to raise the performance to an F 1-value of 92% on the page level with
model (Eq. 8.6).
One should note that the scores delivered by the SVM multi-class classifier are
calibrated and represent class membership probabilities. Thus, thresholding can
be applied to control the amount of errors that a customer expects from automatic
decisions, and to control the amount of manual review of decisions that have been re-
jected. Using this technique, we can achieve precision of > 95% while simultaneously
keeping the recall above 80%.
8.5.1 Production Deployments
The deployment of an automatic document separation solution is a lengthy process,
as is common for any workflow-changing installation in large organizations. Most
often, a proof-of-concept phase precedes the deployment proper. This part of a
project can be pre-sales in order to demonstrate the feasibility of the approach to
the customer or it can be as the first step in a deployment to find out how much
automation can be introduced with high accuracy. In a proof of concept (POC), only
a small subset of document types are considered for classification and separation.
This poses a set of unique problems to consider: The document separator is normally
set up to classify all documents into a set of well-known and well-defined document
types. In a POC, only a subset of document types (say, 10 out of 50) is relevant.
However, the incoming batches still contain documents of all types. The challenge
here is to “actively ignore” the remaining document types without adverse effects
on the classification and separation results for the document types on which we are
concentrating.
The deployment of the production version of the separation solution can take
as long as six months for a medium-sized organization (separating between five and
ten million pages per month). Of this time, two to three months are usually spent
on configuring the software. This includes the setup of the training data for the
separator but also the development of an extraction mechanism that is usually part
of the larger workflow. The rest of the time is used for the purchase and installation of
hardware (possibly new scanners, processing machines, and review stations) and the
retraining of the review personnel. It is good practice to introduce the new workflow
and automated solution in increments, first converting one or two production lines to
the automated separation solution and reviewing the e ciency of the process. Once
the hardware, software and workflow function satisfactorily, the remaining lines are
activated.
Search WWH ::




Custom Search