Information Technology Reference
In-Depth Information
8.4 Document Separation as a Sequence Mapping
Problem
Automatic Document Separation adds two pieces of information to a stream of un-
labeled pages. It inserts boundaries, so that documents are kept together, and it
assigns labels to those documents that indicate their type. The problem can be
seen as the mapping of an input sequence to an output sequence, i.e., a sequence of
scanned paper pages is mapped to a sequence of document types. The mapping of
sequences is a well known problem in Computer Science and there exist many differ-
ent applications. For example, compilers, speech recognition, information extraction,
and machine translation are all instances that have some aspect that deals with the
problem of sequence mapping: A sequence of human readable program statements
to a sequence of machine code, a sequence of acoustic signals to a sequence of words,
a sequence of words to a sequence of tags, and a sequence of, e.g., Spanish words to
a sequence of French words.
In addition, probabilistic models are often employed in order to determine the
probabilities of possible output sequences given a particular input sequence. In this
chapter, we utilize these concepts to solve the problem of document separation:
Map a given sequence of pages to all possible output sequences, i.e., sequences of
document types, determine for each output sequence its probability given the input
sequence, find the most likely output sequence (sequence of document types), and,
thus, effectively separate the sequence of input pages by document type.
8.4.1 Sequence Model
Formally, the procedure described above can be modeled as a Markov chain. De-
noting the input sequence of ordered pages 6 p c by P =( p c
1
,...,p n ) and the output
sequence of document types by D =( d 1 ,...,d n ), the probability of a specific se-
quence of document types D given the input sequence of pages can be written as
n
p ( D|P )=
p ( d j |D j− 1 , P ) ,
(8.1)
j =1
where d j denotes the document type of the j -th page and D j− 1 the output sequence
of document types up to the ( j − 1)-th page. In many practical applications, inde-
pendence assumptions regarding the different events d j , D j− 1 , and P hold at some
level of accuracy and allow estimations of the probability p ( D|P ) that are e cient
yet accurate enough for the given purpose.
We started by assuming that the document type d j at time step j only depends
on the page content p j at time step j and gradually increased the complexity of
the models by taking into account the document types of previous time steps. In
particular, we considered
n
p ( d j |p j )
p ( D|P )
(8.2)
j
=1
as well as the following approximation, which is very common and has been widely
used in several fields, e.g., for information extraction [8],
6 The superscript c indicates that the content of the pages is considered.
Search WWH ::




Custom Search