Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

p ( d j |d j− 1 ,d j− 2 ,p j− 1 ,p j ,p j +1 )

p ( D|P ) ≈

(8.7)

j =1

p ( d j |p j− 1 ,p j ,p j +1 ) ,

p ( D|P ) ≈

(8.8)

j =1

where the model given by Eq. 8.8 has the same constrained output language as

the model of Eq. 8.6, i.e., an output language consistent with the definitions of the

events {start, middle, end} given by Eq. 8.5.

8.4.2 Sequence Model Estimation

The problem of determining the different sequence models introduced in the previous

section is given by estimating a probability of the form p ( x|p c ,y ) with e.g., x denoting

a document type and y a history of document types. As outlined in Section 8.3, a

bag of words model is used for the page content 7 p c , i.e., p c = { ( c 1 ,w 1 ) ,..., ( c n ,w n ) }

with c j denoting the number of occurences of word w j on the page, yielding

|x, y ) p ( x, y )

p ( x|p c ,y )= p ( p c

p ( w j |x, y ) c j p ( x, y ) ,

p ( p c ,y ) ∝

(8.9)

j =1

whereby in the last step the constant factor 1 /p ( p c ,y ) has been omitted. As can be

seen from Eq. 8.9, the sequence model estimation is reduced to the determination of

the probabilities p ( w j |x, y ) and p ( x, y ). These probabilities are estimated empirically

by using sample documents (training examples) for the various events ( x, y ). For a

typical training corpus , provided by the customer, the statistics for determining the

word probabilities p ( w|x, y ) are very low. 8 Given such statistics, overfitting to the

training data is a common problem. Smoothing techniques, like those developed for

language modeling [9], are a common tool to address the problem of low statistics by

reserving some probability mass for unobserved events. In the case of determining

the conditioned word probabilities p ( w|x, y ), words that have been observed in the

training data would be assigned lower probabilites than the maximum likelihood

estimates, whereas unobserved words would be assigned higher probabilities than

their maximum likelihood estimates. Statistical learning methods, e.g., [10, 11], uti-

lizing methods of regularization theory, allow us to determine the tradeoff between

memorization and generalization more principled than the smoothing techniques

mentioned above. The learning method adopted here for estimating the sequence

model is a Support Vector Machine[10] (SVM). It is commonly known that Support

Vector Machines are well suited for text applications given a small number of train-

ing examples [12]. This is an important aspect for the commercial use of the system,

since the process of gathering, preparing, and cleaning up training examples is time

consuming and expensive.

7 Here, p c indicates both content models we are considering: The page content at

a given time step as well as the content of the pages p j− 1 ,p j ,p j +1 at a time step

j .

8 For a typical training corpus, almost all words occur rarely with words counts of

one to two.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home