Information Technology Reference
In-Depth Information
Table 8.5. Comparison of separation and classifcation results of the various se-
quence models.
Seqence model Probability
Micro-averaged F1-value
p ( d j |p j )
Eq. 8.2
0.63
p ( d j |d j− 1 ,p j )
Eq. 8.3
0.74
p ( d j |d j− 1 ,d j− 2 ,p j )
Eq. 8.4
0.83
p ( d j |p j )
Eq. 8.6
0.84
d j− 1 ,d j− 2 ,p j− 1 ,p j ,p j +1 )
Eq. 8.7
p ( d j |
0.86
p ( d j |p j− 1 ,p j ,p j +1 )
Eq. 8.8
0.87
The inclusion of a history of document types improves performance. This is not
surprising, given the fact that forms are, on average, longer than one page. For
instance, using a trigram model instead of a unigram yields an improvement of
31%.
Specializing page descriptions improves performance. This confirms our earlier
reasoning that forms often exhibit specific start and end pages. It also allows
the model to separate two consecutive instances of the same document type.
Conditioning on the content of surrounding pages improves performance. Com-
paring the last two rows in Table 8.5 with their counterparts without the content
of the surrounding pages in the condition indicates a boost of around 3.5% in
F 1-value.
The last model (Eq. 8.8) is the best model in our experiments. However, it
presents a serious drawback in that it uses roughly three times the number of fea-
tures to describe a page (namely, the content of the page itself and that of the two
surrounding pages). Given the increased CPU and memory usage during training,
this seemed too high a price to pay for a 3% gain in performance. Thus, for deploy-
ment into customer production systems, we decided to use the model according to
Eq. 8.6. It is the best of the one-page-content models, and the distinction of page
types not only makes the model more e cient, but also helps with the integration
of the separation workflow in a broader, extraction-oriented system owing to its
capability of separating two consecutive identical forms.
Table 8.6 shows detailed results for the final deployment model. For each doc-
ument type, the table shows the absolute counts of the results, precision, recall,
and F 1-value in two different scenarios. The first six columns show results on the
page level: For each page, the predicted document type is compared with the true
document type, and results calculated from that. The last six columns show values
on the sequence level, taking into account full documents rather than pages. Each
document (i.e., the sequence of pages from start page to end page) is compared with
a gold standard; if both the extent of the document and its type match, the docu-
ment is counted as correct. If either the document type or the pages contained in the
document do not match, the document is counted as incorrect. These measures are
much more strict than the page-level measures, as can be seen from the micro- and
macro averages. Note that Table 8.6 reports on an experiment with 30 document
types; however, the method scales well, and we achieve similar results with much
larger numbers of categories.
 
Search WWH ::




Custom Search