Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

n

p ( d j |d j− 1 ,p j )

p ( D|P ) ≈

(8.3)

j =1

and finally

n

d j− 1 ,d j− 2 ,p j ) .

p (

D|P

)

≈

p ( d j |

(8.4)

j =1

Instead of trying to approximate the probality of p ( D|P ) ever more accurately

by relaxing the independence assumptions one also can describe pages in more detail

by breaking up the document types based on the page position within a document.

Functionally, this is achieved by altering the output language. In the extreme, this

would lead to a model of the data in which the symbols of the output language

are different for each page number within the document. You would have symbols

like TaxForm 1 , TaxForm 2 , TaxForm 3 , etc., for the different page numbers within a

tax form. Here, we increased the alphabet of the original output language threefold.

Every document type symbol is split into three symbols: Start , middle , and end

page of the document type. In our experience, forms often have distinctive first and

last pages, e.g., forms ending with signature pages and starting with pages identify-

ing the form, whereas middle pages of forms do not contain as much discriminating

information. Accordingly, the sequences of the new output language are now se-

quences of the type D , where D is given by D =( d 1 ,...,d n )with d j denoting

the document type as well as the page type. The definitions of the page type events

{start, middle, end} are:

{p j,t |t =1 ,t≤ l}

start :

{p j,t |t> 1 ,t < l}

middle :

{p j,t |t> 1 ,t = l}

end :

(8.5)

where j is the global page number within the batch and t is the local page number

within a document of length l .

One of the models considered using the new output language is

n

p ( D |P ) ≈

p ( d j |p j ) ,

(8.6)

j =1

under the constraint that the sequence of page types is consistent with the definitions

given by Eq. 8.5, e.g., every document has to end with the end page type with the

exception of one-page documents. The last model has, owing to this constraint,

many similarities with the model given by Eq. 8.4. The main difference between the

two models is that the model of Eq. 8.4 determines boundaries between documents

based on the previous document types, whereas the model of Eq. 8.6 relies mainly on

the difference of start , middle , and end pages within the document type to identify

boundaries. Accordingly, the model of Eq. 8.6 can separate subsequent instances of

the same document type, whereas the model of Eq. 8.4 cannot.

Finally, we also tested models that conditioned the output symbol at a given

time step not only on the content of the current page but also on the previous and

the next page

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home