Information Technology Reference
In-Depth Information
n
p ( d j |d j− 1 ,p j )
p ( D|P )
(8.3)
j =1
and finally
n
d j− 1 ,d j− 2 ,p j ) .
p (
D|P
)
p ( d j |
(8.4)
j =1
Instead of trying to approximate the probality of p ( D|P ) ever more accurately
by relaxing the independence assumptions one also can describe pages in more detail
by breaking up the document types based on the page position within a document.
Functionally, this is achieved by altering the output language. In the extreme, this
would lead to a model of the data in which the symbols of the output language
are different for each page number within the document. You would have symbols
like TaxForm 1 , TaxForm 2 , TaxForm 3 , etc., for the different page numbers within a
tax form. Here, we increased the alphabet of the original output language threefold.
Every document type symbol is split into three symbols: Start , middle , and end
page of the document type. In our experience, forms often have distinctive first and
last pages, e.g., forms ending with signature pages and starting with pages identify-
ing the form, whereas middle pages of forms do not contain as much discriminating
information. Accordingly, the sequences of the new output language are now se-
quences of the type D , where D is given by D =( d 1 ,...,d n )with d j denoting
the document type as well as the page type. The definitions of the page type events
{start, middle, end} are:
{p j,t |t =1 ,t≤ l}
start :
{p j,t |t> 1 ,t < l}
middle :
{p j,t |t> 1 ,t = l}
end :
(8.5)
where j is the global page number within the batch and t is the local page number
within a document of length l .
One of the models considered using the new output language is
n
p ( D |P )
p ( d j |p j ) ,
(8.6)
j =1
under the constraint that the sequence of page types is consistent with the definitions
given by Eq. 8.5, e.g., every document has to end with the end page type with the
exception of one-page documents. The last model has, owing to this constraint,
many similarities with the model given by Eq. 8.4. The main difference between the
two models is that the model of Eq. 8.4 determines boundaries between documents
based on the previous document types, whereas the model of Eq. 8.6 relies mainly on
the difference of start , middle , and end pages within the document type to identify
boundaries. Accordingly, the model of Eq. 8.6 can separate subsequent instances of
the same document type, whereas the model of Eq. 8.4 cannot.
Finally, we also tested models that conditioned the output symbol at a given
time step not only on the content of the current page but also on the previous and
the next page
Search WWH ::




Custom Search