Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Table 8.5. Comparison of separation and classifcation results of the various se-

quence models.

Seqence model Probability

Micro-averaged F1-value

p ( d j |p j )

Eq. 8.2

0.63

p ( d j |d j− 1 ,p j )

Eq. 8.3

0.74

p ( d j |d j− 1 ,d j− 2 ,p j )

Eq. 8.4

0.83

p ( d j |p j )

Eq. 8.6

0.84

d j− 1 ,d j− 2 ,p j− 1 ,p j ,p j +1 )

Eq. 8.7

p ( d j |

0.86

p ( d j |p j− 1 ,p j ,p j +1 )

Eq. 8.8

0.87

•

The inclusion of a history of document types improves performance. This is not

surprising, given the fact that forms are, on average, longer than one page. For

instance, using a trigram model instead of a unigram yields an improvement of

31%.

•

Specializing page descriptions improves performance. This confirms our earlier

reasoning that forms often exhibit specific start and end pages. It also allows

the model to separate two consecutive instances of the same document type.

•

Conditioning on the content of surrounding pages improves performance. Com-

paring the last two rows in Table 8.5 with their counterparts without the content

of the surrounding pages in the condition indicates a boost of around 3.5% in

F 1-value.

The last model (Eq. 8.8) is the best model in our experiments. However, it

presents a serious drawback in that it uses roughly three times the number of fea-

tures to describe a page (namely, the content of the page itself and that of the two

surrounding pages). Given the increased CPU and memory usage during training,

this seemed too high a price to pay for a 3% gain in performance. Thus, for deploy-

ment into customer production systems, we decided to use the model according to

Eq. 8.6. It is the best of the one-page-content models, and the distinction of page

types not only makes the model more e cient, but also helps with the integration

of the separation workflow in a broader, extraction-oriented system owing to its

capability of separating two consecutive identical forms.

Table 8.6 shows detailed results for the final deployment model. For each doc-

ument type, the table shows the absolute counts of the results, precision, recall,

and F 1-value in two different scenarios. The first six columns show results on the

page level: For each page, the predicted document type is compared with the true

document type, and results calculated from that. The last six columns show values

on the sequence level, taking into account full documents rather than pages. Each

document (i.e., the sequence of pages from start page to end page) is compared with

a gold standard; if both the extent of the document and its type match, the docu-

ment is counted as correct. If either the document type or the pages contained in the

document do not match, the document is counted as incorrect. These measures are

much more strict than the page-level measures, as can be seen from the micro- and

macro averages. Note that Table 8.6 reports on an experiment with 30 document

types; however, the method scales well, and we achieve similar results with much

larger numbers of categories.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home