Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

8.4 Document Separation as a Sequence Mapping

Problem

Automatic Document Separation adds two pieces of information to a stream of un-

labeled pages. It inserts boundaries, so that documents are kept together, and it

assigns labels to those documents that indicate their type. The problem can be

seen as the mapping of an input sequence to an output sequence, i.e., a sequence of

scanned paper pages is mapped to a sequence of document types. The mapping of

sequences is a well known problem in Computer Science and there exist many differ-

ent applications. For example, compilers, speech recognition, information extraction,

and machine translation are all instances that have some aspect that deals with the

problem of sequence mapping: A sequence of human readable program statements

to a sequence of machine code, a sequence of acoustic signals to a sequence of words,

a sequence of words to a sequence of tags, and a sequence of, e.g., Spanish words to

a sequence of French words.

In addition, probabilistic models are often employed in order to determine the

probabilities of possible output sequences given a particular input sequence. In this

chapter, we utilize these concepts to solve the problem of document separation:

Map a given sequence of pages to all possible output sequences, i.e., sequences of

document types, determine for each output sequence its probability given the input

sequence, find the most likely output sequence (sequence of document types), and,

thus, effectively separate the sequence of input pages by document type.

8.4.1 Sequence Model

Formally, the procedure described above can be modeled as a Markov chain. De-

noting the input sequence of ordered pages 6 p c by P =( p c

1

,...,p n ) and the output

sequence of document types by D =( d 1 ,...,d n ), the probability of a specific se-

quence of document types D given the input sequence of pages can be written as

n

p ( D|P )=

p ( d j |D j− 1 , P ) ,

(8.1)

j =1

where d j denotes the document type of the j -th page and D j− 1 the output sequence

of document types up to the ( j − 1)-th page. In many practical applications, inde-

pendence assumptions regarding the different events d j , D j− 1 , and P hold at some

level of accuracy and allow estimations of the probability p ( D|P ) that are e cient

yet accurate enough for the given purpose.

We started by assuming that the document type d j at time step j only depends

on the page content p j at time step j and gradually increased the complexity of

the models by taking into account the document types of previous time steps. In

particular, we considered

n

p ( d j |p j )

p ( D|P ) ≈

(8.2)

j

=1

as well as the following approximation, which is very common and has been widely

used in several fields, e.g., for information extraction [8],

6 The superscript c indicates that the content of the pages is considered.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home